Re: DocumentsWriter.checkMaxTermLength issues

Michael McCandless Thu, 20 Dec 2007 12:14:03 -0800

Gabi Steinberg wrote:

On balance, I think that dropping the document makes sense. Ithink Yonik is right in that ensuring that keys are useful - andindexable - is the tokenizer's job.
StandardTokenizer, in my opinion, should behave similarly to aperson looking at a document and deciding which tokens should beindexed. Few people would argue that a 16K block of binary data isuseful for searching, but it's reasonable to suggest that the textaround it is useful.
I know that one can add the LengthFilter to avoid this problem, butthis is not really intuitive; one does not expect the standardtokenizer to generate tokens that IndexWriter chokes on.
My vote is to:
- drop documents with tokens longer than 16K, as Mike and Yoniksuggested- because uninformed user would start with StandardTokenizer, Ithink it should limit token size to 128 bytes, and add options tochange that size, choose between truncating or dropping longertokens, and in no case produce tokens longer that what IndexWritercan digest.

I like this idea, though we probably can't do that until 3.0 so wedon't break backwards compatibility?

- perhaps come up a clear policy on when a tokenizer should throwan exception?

Gabi Steinberg.

Yonik Seeley wrote:

On Dec 20, 2007 11:57 AM, Michael McCandless<[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:

On Dec 20, 2007 11:33 AM, Gabi Steinberg
<[EMAIL PROTECTED]> wrote:

It might be a bit harsh to drop the document if it has a very long
token
in it.

There is really two issues here.
For long tokens, one could either ignore them or generate an
exception.

I can see the argument both ways.

Me too.

 On the one hand, we want indexing
to be robust/resilient, such that massive terms are quietly skipped
(maybe w/ a log to infoStream if its set).

On the other hand, clearly there is something seriously wrong when
your analyzer is producing a single 16+ KB term, and so it would be
nice to be brittle/in-your-face so the user is forced to deal with/
correct the situation.

Also, it's really bad once these terms pollute your index.  EG
suddenly the Terminfos index can easily take tremendous amounts of
RAM, slow down indexing/merging/searching, etc.  This is why

LUCENE-1052 was created. It's alot better if you catch this upfront

then letting it pollute your index.

If we want to take the "in your face" solution, I think the cutoff
should be less than 16 KB (16 KB is just the hard limit inside DW).

For all exceptions generated while indexing a document (that are
passed through to the user)
it seems like that document should not be in the index.

I like this disposition because it means the index is in a known
state.  It's bad to have partial docs in the index: it can only lead

to more confusion as people try to figure out why some terms workfor

retrieving the doc but others don't.

Right... and I think that was the behavior before the indexing code
was rewritten since the new single doc segment was only added after
the complete document was inverted (hence any exception would prevent
it from being added).
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DocumentsWriter.checkMaxTermLength issues

Reply via email to