Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Michael McCandless Tue, 14 Aug 2012 09:47:29 -0700

I had forgotten about this but I agree it could also be used to handle
challenging tokenizations.


In general I think our Tokenizers should throw away as little
information as possible (at least have options to do so).  Subsequent
TokenFilters can always remove things ...

I agree there's a risk of junk getting into indices ... but setting
appropriate defaults should address this.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Aug 14, 2012 at 12:26 PM, Steven A Rowe <[email protected]> wrote:
> Another possibility that would increase customizability via exposing 
> information we currently throw away, proposed by Mike McCandless on 
> LUCENE-3940[1] (though controversially[2]): in addition to tokenizing 
> alpha/numeric char sequences, StandardTokenizer could also tokenize 
> everything else.
>
> Then a NonAlphaNumericStopFilter could remove tokens with types other than 
> <NUM> or <ALPHANUM>.
>
> As an alternative to NonAlphaNumericStopFilter, a separate 
> WordDelimiterFilter-like filter could instead generate synonyms like "wi-fi" 
> and "wifi" when it sees the token sequence ("wi"<ALPHANUM>, "-"<PUNCT>, 
> "fi"<ALPHANUM>).
>
> Positions would need to be addressed.  I assume the default behavior would be 
> to remove position holes when non-alphanumeric tokens are stopped.  (In fact, 
> I can't think of any use case that would benefit from position holes for 
> stopped non-alphanumeric tokens.)
>
> AFAICT, Robert Muir's objection to enabling this kind of thing[2] is that 
> people would use such a tokenizer in default (don't-throw-anything-away) 
> mode, and as a result, unwittingly put tons of junk tokens in their indexes.  
> Maybe this concern could be addressed by making the default behavior the same 
> as it is today, and providing the don't-throw-anything-away behavior as a 
> non-default option?  Standard*Analyzer* would then remain exactly as it is 
> today, and wouldn't need to include a NonAlphaNumericStopFilter.
>
> Steve
>
> [1] Mike McCandless's post on LUCENE-3940 
> <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299>
>
> [2] Robert Muir's subsequent post on LUCENE-3940 
> <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13244124&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13244124>
>
>
> -----Original Message-----
> From: Robert Muir [mailto:[email protected]]
> Sent: Tuesday, August 14, 2012 1:27 AM
> To: [email protected]
> Subject: Re: Improve OOTB behavior: English word-splitting should default to 
> autoGeneratePhraseQueries=true
>
> On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter
> <[email protected]> wrote:
>>
>> : >         http://unicode.org/reports/tr29/#Word_Boundaries
>> : >
>> : > ...I think it would be a good idea to add some new customization options
>> : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
>> : > behavior based on the various "tailored improvement" notes...
>>
>>
>> : Use a CharFilter.
>>
>> can you elaborate on how you would suggest implenting these "tailored
>> improvements" using a CharFilter?
>
> Generally the easiest way is to replace your ambiguous character (such
> as your hyphen-minus) with what your domain-specific knowledge tells
> you it should be.
> If you are indexing a dictionary where this ambiguous hyphen-minus is
> being used to separate syllables, then replace it with \u2027
> (hyphenation point), and it won't trigger word boundaries.
>
> But it really depends on how you want your whole analysis process to
> work. e.g. in the above example if you want to treat "foo-bar" as
> really equivalent to foobar, or you want to treat U.S.A as equivalent
> to USA, because thats how you want your search to work, then I would
> just replace with U+2060 word joiner. follow through with NFKC_CF
> unicode normalization filter in the icu package which will remove
> this, since its Format.
>
> So I think you can handle all of your cases there with a simple regex
> charfilter, substituting the correct 'semantics' depending on
> ultimately how you want it to work, and then just apply nfkc_cf at the
> end.
>
> As far as the last example, no need for the tokenizer to be involved.
> We already have elisionfilter for this, and italian and french
> analyzers use it to remove a default (but configurable) set of
> contractions. The solr example for these languages is setup with
> these, too.
>
> If you really don't like these dead-simple approaches, then just use
> the tokenizer in the ICU package, which is more flexible than the
> jflex implementation: lets you supply custom grammars at runtime, and
> can split by script, etc, etc.
>
>
> --
> lucidworks.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Reply via email to