: But it really depends on how you want your whole analysis process to
: work. e.g. in the above example if you want to treat "foo-bar" as
: really equivalent to foobar, or you want to treat U.S.A as equivalent
Unless i'm missreading the Word Boundary doc, the point of these types of
tailorings is to treat "foo-bar" as a single token "foo-bar" including the
hyphen -- ie: do not treat the hyphen as a "word" character.
If i understand correctly, you are argueing that instead of giving users
an option to tell StandardTokenizer to treat characters like hyphen as a
word character, they can achieve a tailoring like this by using a
CharFilter to translate these to less-ambiguious characters that are
already "word" characters according to the existing rules (ie: \u2027).
I understand how that might be a good idea in general (to normalize the
intra-word punctuation for improve matching if one query uses one type of
hyphen and another query uses a diff type of hyphen) but it still seems to
violate the point of the tailoring acording to the doc -- allowing people
to preserve the actual character in identifiers...
>>> Treatment of hyphens, in particular, may be different in the case of
>>> processing identifiers than when using word break analysis for a Whole
>>> Word Search or query, because when handling identifiers the goal will
>>> be to parse maximal units corresponding to natural language “words,”
>>> rather than to find smaller word units within longer lexical units
>>> connected by hyphens.
The doc even points oout specificly...
>>> Some or all of the following characters may be tailored to be in
>>> MidLetter, depending on the environment:
...
>>> U+002D ( - ) HYPHEN-MINUS
>>> U+058A ( ֊ ) ARMENIAN HYPHEN
>>> U+2010 ( ‐ ) HYPHEN
>>> U+2011 ( ‑ ) NON-BREAKING HYPHEN
>>> U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS
>>> U+FF0D ( - ) FULLWIDTH HYPHEN-MINUS
...so seemingly, according to the word boundary docs, there should be an
option to treat those individual characters as "MidLetter" characters w/o
requiring the user to change them to \u2027 in a CharFilter
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]