Re: DocumentsWriter.checkMaxTermLength issues

Michael McCandless Mon, 31 Dec 2007 02:53:53 -0800

Doron Cohen <[EMAIL PROTECTED]> wrote:
> I like the approach of configuration of this behavior in Analysis
> (and so IndexWriter can throw an exception on such errors).
>
> It seems that this should be a property of Analyzer vs.
> just StandardAnalyzer, right?
>
> It can probably be a "policy" property, with two parameters:
> 1) maxLength, 2) action: chop/split/ignore/raiseException when
> generating too long tokens.


Agreed, this should be generic/shared to all analyzers.

But maybe for 2.3, we just truncate any too-long term to the max
allowed size, and then after 2.3 we make this a settable "policy"?

> Doron
>
> On Dec 21, 2007 10:46 PM, Michael McCandless <[EMAIL PROTECTED]>
> wrote:
>
> >
> > I think this is a good approach -- any objections?
> >
> > This way, IndexWriter is in-your-face (throws TermTooLongException on
> > seeing a massive term), but StandardAnalyzer is robust (silently
> > skips or prefix's the too-long terms).
> >
> > Mike
> >
> > Gabi Steinberg wrote:
> >
> > > How about defaulting to a max token size of 16K in
> > > StandardTokenizer, so that it never causes an IndexWriter
> > > exception, with an option to reduce that size?
> > >
> > > The backward incompatibilty is limited then - tokens exceeding 16K
> > > will NOT causing an IndexWriter exception.  In 3.0 we can reduce
> > > that default to a useful size.
> > >
> > > The option to truncate the token can be useful, I think.  It will
> > > index the max size prefix of the long tokens.  You can still find
> > > them, pretty accurately - this becomes a prefix search, but is
> > > unlikely to return multiple values because it's a long prefix.  It
> > > allow you to choose a relatively small max, such as 32 or 64,
> > > reducing the overhead caused by junk in the documents while
> > > minimizing the chance of not finding something.
> > >
> > > Gabi.
> > >
> > > Michael McCandless wrote:
> > >> Gabi Steinberg wrote:
> > >>> On balance, I think that dropping the document makes sense.  I
> > >>> think Yonik is right in that ensuring that keys are useful - and
> > >>> indexable - is the tokenizer's job.
> > >>>
> > >>> StandardTokenizer, in my opinion, should behave similarly to a
> > >>> person looking at a document and deciding which tokens should be
> > >>> indexed.  Few people would argue that a 16K block of binary data
> > >>> is useful for searching, but it's reasonable to suggest that the
> > >>> text around it is useful.
> > >>>
> > >>> I know that one can add the LengthFilter to avoid this problem,
> > >>> but this is not really intuitive; one does not expect the
> > >>> standard tokenizer to generate tokens that IndexWriter chokes on.
> > >>>
> > >>> My vote is to:
> > >>> - drop documents with tokens longer than 16K, as Mike and Yonik
> > >>> suggested
> > >>> - because uninformed user would start with StandardTokenizer, I
> > >>> think it should limit token size to 128 bytes, and add options to
> > >>> change that size, choose between truncating or dropping longer
> > >>> tokens, and in no case produce tokens longer that what
> > >>> IndexWriter can digest.
> > >> I like this idea, though we probably can't do that until 3.0 so we
> > >> don't break backwards compatibility?
> > > ...
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DocumentsWriter.checkMaxTermLength issues

Reply via email to