: But it really depends on how you want your whole analysis process to
: work. e.g. in the above example if you want to treat "foo-bar" as
: really equivalent to foobar, or you want to treat U.S.A as equivalent

Unless i'm missreading the Word Boundary doc, the point of these types of 
tailorings is to treat "foo-bar" as a single token "foo-bar" including the 
hyphen -- ie: do not treat the hyphen as a "word" character.

If i understand correctly, you are argueing that instead of giving users 
an option to tell StandardTokenizer to treat characters like hyphen as a 
word character, they can achieve a tailoring like this by using a 
CharFilter to translate these to less-ambiguious characters that are 
already "word" characters according to the existing rules (ie: \u2027).

I understand how that might be a good idea in general (to normalize the 
intra-word punctuation for improve matching if one query uses one type of 
hyphen and another query uses a diff type of hyphen) but it still seems to 
violate the point of the tailoring acording to the doc -- allowing people 
to preserve the actual character in identifiers...

>>> Treatment of hyphens, in particular, may be different in the case of
>>> processing identifiers than when using word break analysis for a Whole 
>>> Word Search or query, because when handling identifiers the goal will 
>>> be to parse maximal units corresponding to natural language “words,” 
>>> rather than to find smaller word units within longer lexical units 
>>> connected by hyphens.

The doc even points oout specificly...

>>> Some or all of the following characters may be tailored to be in 
>>> MidLetter, depending on the environment:  
    ...
>>> U+002D ( - ) HYPHEN-MINUS
>>> U+058A ( ֊ ) ARMENIAN HYPHEN
>>> U+2010 ( ‐ ) HYPHEN
>>> U+2011 ( ‑ ) NON-BREAKING HYPHEN
>>> U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS
>>> U+FF0D ( - ) FULLWIDTH HYPHEN-MINUS

...so seemingly, according to the word boundary docs, there should be an 
option to treat those individual characters as "MidLetter" characters w/o 
requiring the user to change them to \u2027 in a CharFilter



-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to