Doron Cohen <[EMAIL PROTECTED]> wrote: > I like the approach of configuration of this behavior in Analysis > (and so IndexWriter can throw an exception on such errors). > > It seems that this should be a property of Analyzer vs. > just StandardAnalyzer, right? > > It can probably be a "policy" property, with two parameters: > 1) maxLength, 2) action: chop/split/ignore/raiseException when > generating too long tokens.
Agreed, this should be generic/shared to all analyzers. But maybe for 2.3, we just truncate any too-long term to the max allowed size, and then after 2.3 we make this a settable "policy"? > Doron > > On Dec 21, 2007 10:46 PM, Michael McCandless <[EMAIL PROTECTED]> > wrote: > > > > > I think this is a good approach -- any objections? > > > > This way, IndexWriter is in-your-face (throws TermTooLongException on > > seeing a massive term), but StandardAnalyzer is robust (silently > > skips or prefix's the too-long terms). > > > > Mike > > > > Gabi Steinberg wrote: > > > > > How about defaulting to a max token size of 16K in > > > StandardTokenizer, so that it never causes an IndexWriter > > > exception, with an option to reduce that size? > > > > > > The backward incompatibilty is limited then - tokens exceeding 16K > > > will NOT causing an IndexWriter exception. In 3.0 we can reduce > > > that default to a useful size. > > > > > > The option to truncate the token can be useful, I think. It will > > > index the max size prefix of the long tokens. You can still find > > > them, pretty accurately - this becomes a prefix search, but is > > > unlikely to return multiple values because it's a long prefix. It > > > allow you to choose a relatively small max, such as 32 or 64, > > > reducing the overhead caused by junk in the documents while > > > minimizing the chance of not finding something. > > > > > > Gabi. > > > > > > Michael McCandless wrote: > > >> Gabi Steinberg wrote: > > >>> On balance, I think that dropping the document makes sense. I > > >>> think Yonik is right in that ensuring that keys are useful - and > > >>> indexable - is the tokenizer's job. > > >>> > > >>> StandardTokenizer, in my opinion, should behave similarly to a > > >>> person looking at a document and deciding which tokens should be > > >>> indexed. Few people would argue that a 16K block of binary data > > >>> is useful for searching, but it's reasonable to suggest that the > > >>> text around it is useful. > > >>> > > >>> I know that one can add the LengthFilter to avoid this problem, > > >>> but this is not really intuitive; one does not expect the > > >>> standard tokenizer to generate tokens that IndexWriter chokes on. > > >>> > > >>> My vote is to: > > >>> - drop documents with tokens longer than 16K, as Mike and Yonik > > >>> suggested > > >>> - because uninformed user would start with StandardTokenizer, I > > >>> think it should limit token size to 128 bytes, and add options to > > >>> change that size, choose between truncating or dropping longer > > >>> tokens, and in no case produce tokens longer that what > > >>> IndexWriter can digest. > > >> I like this idea, though we probably can't do that until 3.0 so we > > >> don't break backwards compatibility? > > > ... > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]