Why does the StandardTokenizer split hyphenated words?

Mike Snare Wed, 15 Dec 2004 10:28:52 -0800

I am writing a tool that uses lucene, and I immediately ran into a
problem searching for words that contain internal hyphens (dashes).
After looking at the StandardTokenizer, I saw that it was because
there is no rule that will match <ALPHA> <P> <ALPHA> or <ALPHANUM> <P>
<ALPHANUM>.  Based on what I can tell from the source, every other
term in a word containing any of the following (.,/-_) must contain at
least one digit.


I was wondering if someone could shed some light on why it was deemed
necessary to prevent indexing a word like 'word-with-hyphen' without
first splitting it into its constituent parts.  The only reason I can
think of (and the only one I've found) is to handle hyphenated words
at line breaks, although my first thought would be that this would be
undesired behavior, since a word that was broken due to a line break
should actually be reconstructed, and not split.

In my case, the words are keywords that must remain as is, searchable
with the hyphen in place.  It was easy enough to modify the tokenizer
to do what I need, so I'm not really asking for help there.  I'm
really just curious as to why it is that "a-1" is considered a single
token, but "a-b" is split.

Anyone care to elaborate?

Thanks,
-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Why does the StandardTokenizer split hyphenated words?

Reply via email to