Yonik Seeley wrote:
On Dec 20, 2007 11:33 AM, Gabi Steinberg
<[EMAIL PROTECTED]> wrote:
It might be a bit harsh to drop the document if it has a very long
token
in it.
There is really two issues here.
For long tokens, one could either ignore them or generate an
exception.
I can see the argument both ways. On the one hand, we want indexing
to be robust/resilient, such that massive terms are quietly skipped
(maybe w/ a log to infoStream if its set).
On the other hand, clearly there is something seriously wrong when
your analyzer is producing a single 16+ KB term, and so it would be
nice to be brittle/in-your-face so the user is forced to deal with/
correct the situation.
Also, it's really bad once these terms pollute your index. EG
suddenly the Terminfos index can easily take tremendous amounts of
RAM, slow down indexing/merging/searching, etc. This is why
LUCENE-1052 was created. It's alot better if you catch this up front
then letting it pollute your index.
If we want to take the "in your face" solution, I think the cutoff
should be less than 16 KB (16 KB is just the hard limit inside DW).
For all exceptions generated while indexing a document (that are
passed through to the user)
it seems like that document should not be in the index.
I like this disposition because it means the index is in a known
state. It's bad to have partial docs in the index: it can only lead
to more confusion as people try to figure out why some terms work for
retrieving the doc but others don't.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]