I had forgotten about this but I agree it could also be used to handle challenging tokenizations.
In general I think our Tokenizers should throw away as little information as possible (at least have options to do so). Subsequent TokenFilters can always remove things ... I agree there's a risk of junk getting into indices ... but setting appropriate defaults should address this. Mike McCandless http://blog.mikemccandless.com On Tue, Aug 14, 2012 at 12:26 PM, Steven A Rowe <[email protected]> wrote: > Another possibility that would increase customizability via exposing > information we currently throw away, proposed by Mike McCandless on > LUCENE-3940[1] (though controversially[2]): in addition to tokenizing > alpha/numeric char sequences, StandardTokenizer could also tokenize > everything else. > > Then a NonAlphaNumericStopFilter could remove tokens with types other than > <NUM> or <ALPHANUM>. > > As an alternative to NonAlphaNumericStopFilter, a separate > WordDelimiterFilter-like filter could instead generate synonyms like "wi-fi" > and "wifi" when it sees the token sequence ("wi"<ALPHANUM>, "-"<PUNCT>, > "fi"<ALPHANUM>). > > Positions would need to be addressed. I assume the default behavior would be > to remove position holes when non-alphanumeric tokens are stopped. (In fact, > I can't think of any use case that would benefit from position holes for > stopped non-alphanumeric tokens.) > > AFAICT, Robert Muir's objection to enabling this kind of thing[2] is that > people would use such a tokenizer in default (don't-throw-anything-away) > mode, and as a result, unwittingly put tons of junk tokens in their indexes. > Maybe this concern could be addressed by making the default behavior the same > as it is today, and providing the don't-throw-anything-away behavior as a > non-default option? Standard*Analyzer* would then remain exactly as it is > today, and wouldn't need to include a NonAlphaNumericStopFilter. > > Steve > > [1] Mike McCandless's post on LUCENE-3940 > <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299> > > [2] Robert Muir's subsequent post on LUCENE-3940 > <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13244124&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13244124> > > > -----Original Message----- > From: Robert Muir [mailto:[email protected]] > Sent: Tuesday, August 14, 2012 1:27 AM > To: [email protected] > Subject: Re: Improve OOTB behavior: English word-splitting should default to > autoGeneratePhraseQueries=true > > On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter > <[email protected]> wrote: >> >> : > http://unicode.org/reports/tr29/#Word_Boundaries >> : > >> : > ...I think it would be a good idea to add some new customization options >> : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the >> : > behavior based on the various "tailored improvement" notes... >> >> >> : Use a CharFilter. >> >> can you elaborate on how you would suggest implenting these "tailored >> improvements" using a CharFilter? > > Generally the easiest way is to replace your ambiguous character (such > as your hyphen-minus) with what your domain-specific knowledge tells > you it should be. > If you are indexing a dictionary where this ambiguous hyphen-minus is > being used to separate syllables, then replace it with \u2027 > (hyphenation point), and it won't trigger word boundaries. > > But it really depends on how you want your whole analysis process to > work. e.g. in the above example if you want to treat "foo-bar" as > really equivalent to foobar, or you want to treat U.S.A as equivalent > to USA, because thats how you want your search to work, then I would > just replace with U+2060 word joiner. follow through with NFKC_CF > unicode normalization filter in the icu package which will remove > this, since its Format. > > So I think you can handle all of your cases there with a simple regex > charfilter, substituting the correct 'semantics' depending on > ultimately how you want it to work, and then just apply nfkc_cf at the > end. > > As far as the last example, no need for the tokenizer to be involved. > We already have elisionfilter for this, and italian and french > analyzers use it to remove a default (but configurable) set of > contractions. The solr example for these languages is setup with > these, too. > > If you really don't like these dead-simple approaches, then just use > the tokenizer in the ICU package, which is more flexible than the > jflex implementation: lets you supply custom grammars at runtime, and > can split by script, etc, etc. > > > -- > lucidworks.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
