On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote:

On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokens in
the core indexing code (except to handle tokens >16k as an error
condition).

I would agree here, with the exception that I want the option for it
to be treated as an error.

That should also be possible via an analyzer component throwing an exception.


Sure, but I mean in the >16K (in other words, in the case where DocsWriter fails, which presumably only DocsWriter knows about) case. I want the option to ignore tokens larger than that instead of failing/ throwing an exception. Imagine I am charged w/ indexing some data that I don't know anything about (i.e. computer forensics), my goal would be to index as much as possible in my first raw pass, so that I can then begin to explore the dataset. Having it completely discard the document is not a good thing, but throwing away some large binary tokens would be acceptable (especially if I get warnings about said tokens) and robust.

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to