On Dec 20, 2007, at 11:57 AM, Michael McCandless wrote:


Yonik Seeley wrote:

On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote:
It might be a bit harsh to drop the document if it has a very long token
in it.

There is really two issues here.
For long tokens, one could either ignore them or generate an exception.

I can see the argument both ways. On the one hand, we want indexing to be robust/resilient, such that massive terms are quietly skipped (maybe w/ a log to infoStream if its set).

This would be fine for me. In some sense, it is just like applying the LengthFilter, which removes tokens silently, too, but works for all analyzers. But, I can see the value in the throw the exception case too, except I think the API should declare the exception is being thrown. It could throw an extension of IOException.




On the other hand, clearly there is something seriously wrong when your analyzer is producing a single 16+ KB term, and so it would be nice to be brittle/in-your-face so the user is forced to deal with/ correct the situation.

Also, it's really bad once these terms pollute your index. EG suddenly the Terminfos index can easily take tremendous amounts of RAM, slow down indexing/merging/searching, etc. This is why LUCENE-1052 was created. It's alot better if you catch this up front then letting it pollute your index.

If we want to take the "in your face" solution, I think the cutoff should be less than 16 KB (16 KB is just the hard limit inside DW).

For all exceptions generated while indexing a document (that are
passed through to the user)
it seems like that document should not be in the index.

I like this disposition because it means the index is in a known state. It's bad to have partial docs in the index: it can only lead to more confusion as people try to figure out why some terms work for retrieving the doc but others don't.

Mike



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to