Gabi Steinberg wrote:
On balance, I think that dropping the document makes sense. I
think Yonik is right in that ensuring that keys are useful - and
indexable - is the tokenizer's job.
StandardTokenizer, in my opinion, should behave similarly to a
person looking at a document and deciding which tokens should be
indexed. Few people would argue that a 16K block of binary data is
useful for searching, but it's reasonable to suggest that the text
around it is useful.
I know that one can add the LengthFilter to avoid this problem, but
this is not really intuitive; one does not expect the standard
tokenizer to generate tokens that IndexWriter chokes on.
My vote is to:
- drop documents with tokens longer than 16K, as Mike and Yonik
suggested
- because uninformed user would start with StandardTokenizer, I
think it should limit token size to 128 bytes, and add options to
change that size, choose between truncating or dropping longer
tokens, and in no case produce tokens longer that what IndexWriter
can digest.
I like this idea, though we probably can't do that until 3.0 so we
don't break backwards compatibility?
- perhaps come up a clear policy on when a tokenizer should throw
an exception?
Gabi Steinberg.
Yonik Seeley wrote:
On Dec 20, 2007 11:57 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
Yonik Seeley wrote:
On Dec 20, 2007 11:33 AM, Gabi Steinberg
<[EMAIL PROTECTED]> wrote:
It might be a bit harsh to drop the document if it has a very long
token
in it.
There is really two issues here.
For long tokens, one could either ignore them or generate an
exception.
I can see the argument both ways.
Me too.
On the one hand, we want indexing
to be robust/resilient, such that massive terms are quietly skipped
(maybe w/ a log to infoStream if its set).
On the other hand, clearly there is something seriously wrong when
your analyzer is producing a single 16+ KB term, and so it would be
nice to be brittle/in-your-face so the user is forced to deal with/
correct the situation.
Also, it's really bad once these terms pollute your index. EG
suddenly the Terminfos index can easily take tremendous amounts of
RAM, slow down indexing/merging/searching, etc. This is why
LUCENE-1052 was created. It's alot better if you catch this up
front
then letting it pollute your index.
If we want to take the "in your face" solution, I think the cutoff
should be less than 16 KB (16 KB is just the hard limit inside DW).
For all exceptions generated while indexing a document (that are
passed through to the user)
it seems like that document should not be in the index.
I like this disposition because it means the index is in a known
state. It's bad to have partial docs in the index: it can only lead
to more confusion as people try to figure out why some terms work
for
retrieving the doc but others don't.
Right... and I think that was the behavior before the indexing code
was rewritten since the new single doc segment was only added after
the complete document was inverted (hence any exception would prevent
it from being added).
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]