On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote: > > On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > > On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> > > wrote: > > > Doron Cohen <[EMAIL PROTECTED]> wrote: > > > > I like the approach of configuration of this behavior in Analysis > > > > (and so IndexWriter can throw an exception on such errors). > > > > > > > > It seems that this should be a property of Analyzer vs. > > > > just StandardAnalyzer, right? > > > > > > > > It can probably be a "policy" property, with two parameters: > > > > 1) maxLength, 2) action: chop/split/ignore/raiseException when > > > > generating too long tokens. > > > > > > Agreed, this should be generic/shared to all analyzers. > > > > > > But maybe for 2.3, we just truncate any too-long term to the max > > > allowed size, and then after 2.3 we make this a settable "policy"? > > > > But we already have a nice component model for analyzers... > > why not just encapsulate truncation/discarding in a TokenFilter? > > > Makes sense, especially for the implementation aspect. > I'm not sure what API you have in mind: > > (1) leave that for applications, to append such a > TokenFilter to their Analyzer (== no change), > > (2) DocumentsWriter to create such a TokenFilter > under the cover, to force behavior that is defined (where?), or > > (3) have an IndexingTokenFilter assigned to IndexWriter, > make the default such filter trim/ignore/whatever as discussed > and then applications can set a different IndexingTokenFilter for > changing the default behavior? > > I think I like the 3'rd option - is this what you meant?
I meant (1)... it leaves the core smaller. I don't see any reason to have logic to truncate or discard tokens in the core indexing code (except to handle tokens >16k as an error condition). Most of the time you want to catch those large tokens early on in the chain anyway (put the filter right after the tokenizer). Doing it later could cause exceptions or issues with other token filters that might not be expecting huge tokens. -Yonik --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
