: Because people impl the default algorithm for general purposes. Those
: tailorings are not 'mandatory'.
I didn't say they were mandatory, I said it seems like it would be a good
idea to add options for them.
The spec says: "... implementations may override (tailor) the results to
meet the requirements of different environments or particular languages.
For some languages, it may also be necessary to have different tailored
word break rules for selection versus Whole Word Search" -- and i am
suggesting that our implementaion (StandardTokenzier) should have options
for these suggested tailorings to make it easier to meet the requirements
of various envornments/langauges our users will care out. So that they
can "turn on" these tailorngs w/o being requred to compleltey re-implent
the entire Tokenizer.
Or at the very least, provide recepies for people who want to achieve
those tailorings using other means -- ie: a doc somewhere that suggests
the "breaking between different scripts" tailoring can be acheived with a
simple PatternCharFilter seems fine, since the whole point is to break
more often then the default algorithm. But for people who want to take
advantage of tailorings that break *less* often, I don't see any easy
way for people to do that on their own, so it seems like we should have
an option to do them on the StandardTokenizer itself.
(either that: or go with mccandles idea to leave *EVERYTHING* in the
tokenztream, and offer TokenFilters that can re-constitue tokens in cases
where hte user thinks StadnardToknenizer applied breaks too often)
The hyphen situation is a prime example: if people want to index terms
that contain literal hyphen characters in the middle of them, w/o changing
those charcters into something else that seems like something that should
be possible using StandardTokenizer. Circling back to the start of this
thread, it would also make it easier to address the crux of the concern
about using StandardTokenizer with english and if/when
autoGeneratePhraseQueries should be used...
1) if you want the input "fly-swatter" to be treated as a single
token, leave this default settings alone.
2) if you want the input "fly-swatter" to be broken into two tokens,
set this "wordBreakOnHyphens" option on the StandardTokenizer to true
2a) if this is in a query analyzer, the "fly" and "swatter"
tokens will be used to make a BooleanQUery by defualt
2b) if you want a phrase query to be built instead, use
autoGeneratePhraseQueries=true, but this will affect all
cases where a wordbreak was found.
..ie: stop forcing users to choose between phrase wheres for
hypenenated works in english vs "sane" queries for all of the languages on
the planet that don't use shitepace between words, and instead let the
user make a choice about the hyphens directly - and then they can still
make a choice about hte phrase queries if they want.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]