It might be a bit harsh to drop the document if it has a very long token in it. I can imagine documents with embedded binary data, where the text around the binary data is still useful for search.

My feeling is that long tokens (longer than 128 or 256 bytes) are not useful for search, and should be truncated or dropped.

Gabi.

Yonik Seeley wrote:
On Dec 20, 2007 11:15 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
Though ... we could simply immediately delete the document when any
exception occurs during its processing.  So if we think whenever any
doc hits an exception, then it should be deleted, it's not so hard to
implement that policy...

It does seem like you only want documents in the index that didn't
generate exceptions... otherwise it doesn't seem like you would know
exactly what got indexed.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to