Yonik Seeley wrote:

On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote:
It might be a bit harsh to drop the document if it has a very long token
in it.

There is really two issues here.
For long tokens, one could either ignore them or generate an exception.

I can see the argument both ways. On the one hand, we want indexing to be robust/resilient, such that massive terms are quietly skipped (maybe w/ a log to infoStream if its set).

On the other hand, clearly there is something seriously wrong when your analyzer is producing a single 16+ KB term, and so it would be nice to be brittle/in-your-face so the user is forced to deal with/ correct the situation.

Also, it's really bad once these terms pollute your index. EG suddenly the Terminfos index can easily take tremendous amounts of RAM, slow down indexing/merging/searching, etc. This is why LUCENE-1052 was created. It's alot better if you catch this up front then letting it pollute your index.

If we want to take the "in your face" solution, I think the cutoff should be less than 16 KB (16 KB is just the hard limit inside DW).

For all exceptions generated while indexing a document (that are
passed through to the user)
it seems like that document should not be in the index.

I like this disposition because it means the index is in a known state. It's bad to have partial docs in the index: it can only lead to more confusion as people try to figure out why some terms work for retrieving the doc but others don't.

Mike



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to