On Dec 20, 2007, at 11:57 AM, Michael McCandless wrote:
Yonik Seeley wrote:
On Dec 20, 2007 11:33 AM, Gabi Steinberg
<[EMAIL PROTECTED]> wrote:
It might be a bit harsh to drop the document if it has a very long
token
in it.
There is really two issues here.
For long tokens, one could either ignore them or generate an
exception.
I can see the argument both ways. On the one hand, we want indexing
to be robust/resilient, such that massive terms are quietly skipped
(maybe w/ a log to infoStream if its set).
This would be fine for me. In some sense, it is just like applying
the LengthFilter, which removes tokens silently, too, but works for
all analyzers. But, I can see the value in the throw the exception
case too, except I think the API should declare the exception is being
thrown. It could throw an extension of IOException.
On the other hand, clearly there is something seriously wrong when
your analyzer is producing a single 16+ KB term, and so it would be
nice to be brittle/in-your-face so the user is forced to deal with/
correct the situation.
Also, it's really bad once these terms pollute your index. EG
suddenly the Terminfos index can easily take tremendous amounts of
RAM, slow down indexing/merging/searching, etc. This is why
LUCENE-1052 was created. It's alot better if you catch this up
front then letting it pollute your index.
If we want to take the "in your face" solution, I think the cutoff
should be less than 16 KB (16 KB is just the hard limit inside DW).
For all exceptions generated while indexing a document (that are
passed through to the user)
it seems like that document should not be in the index.
I like this disposition because it means the index is in a known
state. It's bad to have partial docs in the index: it can only lead
to more confusion as people try to figure out why some terms work
for retrieving the doc but others don't.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]