Re: DocumentsWriter.checkMaxTermLength issues

Michael McCandless Thu, 20 Dec 2007 08:58:15 -0800


Yonik Seeley wrote:

On Dec 20, 2007 11:33 AM, Gabi Steinberg<[EMAIL PROTECTED]> wrote:
It might be a bit harsh to drop the document if it has a very longtoken
in it.
There is really two issues here.
For long tokens, one could either ignore them or generate anexception.

I can see the argument both ways. On the one hand, we want indexingto be robust/resilient, such that massive terms are quietly skipped(maybe w/ a log to infoStream if its set).

On the other hand, clearly there is something seriously wrong whenyour analyzer is producing a single 16+ KB term, and so it would benice to be brittle/in-your-face so the user is forced to deal with/correct the situation.

Also, it's really bad once these terms pollute your index. EGsuddenly the Terminfos index can easily take tremendous amounts ofRAM, slow down indexing/merging/searching, etc. This is whyLUCENE-1052 was created. It's alot better if you catch this up frontthen letting it pollute your index.

If we want to take the "in your face" solution, I think the cutoffshould be less than 16 KB (16 KB is just the hard limit inside DW).

For all exceptions generated while indexing a document (that are
passed through to the user)
it seems like that document should not be in the index.

I like this disposition because it means the index is in a knownstate. It's bad to have partial docs in the index: it can only leadto more confusion as people try to figure out why some terms work forretrieving the doc but others don't.


Mike



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DocumentsWriter.checkMaxTermLength issues

Reply via email to