Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Chris Hostetter Mon, 20 Aug 2012 12:44:46 -0700

: Because people impl the default algorithm for general purposes. Those
: tailorings are not 'mandatory'.


I didn't say they were mandatory, I said it seems like it would be a good 
idea to add options for them.

The spec says: "... implementations may override (tailor) the results to 
meet the requirements of different environments or particular languages. 
For some languages, it may also be necessary to have different tailored 
word break rules for selection versus Whole Word Search" -- and i am 
suggesting that our implementaion (StandardTokenzier) should have options 
for these suggested tailorings to make it easier to meet the requirements 
of various envornments/langauges our users will care out.  So that they 
can "turn on" these tailorngs w/o being requred to compleltey re-implent 
the entire Tokenizer.

Or at the very least, provide recepies for people who want to achieve 
those tailorings using other means -- ie: a doc somewhere that suggests 
the "breaking between different scripts" tailoring can be acheived with a 
simple PatternCharFilter seems fine, since the whole point is to break 
more often then the default algorithm.  But for people who want to take 
advantage of tailorings that break *less* often, I don't see any easy 
way for people to do that on their own, so it seems like we should have 
an option to do them on the StandardTokenizer itself.

(either that: or go with mccandles idea to leave *EVERYTHING* in the 
tokenztream, and offer TokenFilters that can re-constitue tokens in cases 
where hte user thinks StadnardToknenizer applied breaks too often)


The hyphen situation is a prime example: if people want to index terms 
that contain literal hyphen characters in the middle of them, w/o changing 
those charcters into something else that seems like something that should 
be possible using StandardTokenizer.  Circling back to the start of this 
thread, it would also make it easier to address the crux of the concern 
about using StandardTokenizer with english and if/when 
autoGeneratePhraseQueries should be used...

 1) if you want the input "fly-swatter" to be treated as a single 
    token, leave this default settings alone.
 2) if you want the input "fly-swatter" to be broken into two tokens, 
    set this "wordBreakOnHyphens" option on the StandardTokenizer to true
    2a) if this is in a query analyzer, the "fly" and "swatter" 
        tokens will be used to make a BooleanQUery by defualt
    2b) if you want a phrase query to be built instead, use
        autoGeneratePhraseQueries=true, but this will affect all 
        cases where a wordbreak was found.

..ie: stop forcing users to choose between phrase wheres for 
hypenenated works in english vs "sane" queries for all of the languages on 
the planet that don't use shitepace between words, and instead let the 
user make a choice about the hyphens directly - and then they can still 
make a choice about hte phrase queries if they want.




-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Reply via email to