On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote:
On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:
I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokens
in
the core indexing code (except to handle tokens >16k as an error
condition).
I would agree here, with the exception that I want the option for it
to be treated as an error.
That should also be possible via an analyzer component throwing an
exception.
Sure, but I mean in the >16K (in other words, in the case where
DocsWriter fails, which presumably only DocsWriter knows about) case.
I want the option to ignore tokens larger than that instead of failing/
throwing an exception. Imagine I am charged w/ indexing some data
that I don't know anything about (i.e. computer forensics), my goal
would be to index as much as possible in my first raw pass, so that I
can then begin to explore the dataset. Having it completely discard
the document is not a good thing, but throwing away some large binary
tokens would be acceptable (especially if I get warnings about said
tokens) and robust.
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]