Re: DocumentsWriter.checkMaxTermLength issues

Grant Ingersoll Mon, 31 Dec 2007 09:26:07 -0800


On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote:

On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote:

I meant (1)... it leaves the core smaller.
I don't see any reason to have logic to truncate or discard tokensin
the core indexing code (except to handle tokens >16k as an error
condition).


I would agree here, with the exception that I want the option for it
to be treated as an error.

That should also be possible via an analyzer component throwing anexception.

Sure, but I mean in the >16K (in other words, in the case whereDocsWriter fails, which presumably only DocsWriter knows about) case.I want the option to ignore tokens larger than that instead of failing/throwing an exception. Imagine I am charged w/ indexing some datathat I don't know anything about (i.e. computer forensics), my goalwould be to index as much as possible in my first raw pass, so that Ican then begin to explore the dataset. Having it completely discardthe document is not a good thing, but throwing away some large binarytokens would be acceptable (especially if I get warnings about saidtokens) and robust.


-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DocumentsWriter.checkMaxTermLength issues

Reply via email to