I am writing a tool that uses lucene, and I immediately ran into a problem searching for words that contain internal hyphens (dashes). After looking at the StandardTokenizer, I saw that it was because there is no rule that will match <ALPHA> <P> <ALPHA> or <ALPHANUM> <P> <ALPHANUM>. Based on what I can tell from the source, every other term in a word containing any of the following (.,/-_) must contain at least one digit.
I was wondering if someone could shed some light on why it was deemed necessary to prevent indexing a word like 'word-with-hyphen' without first splitting it into its constituent parts. The only reason I can think of (and the only one I've found) is to handle hyphenated words at line breaks, although my first thought would be that this would be undesired behavior, since a word that was broken due to a line break should actually be reconstructed, and not split. In my case, the words are keywords that must remain as is, searchable with the hyphen in place. It was easy enough to modify the tokenizer to do what I need, so I'm not really asking for help there. I'm really just curious as to why it is that "a-1" is considered a single token, but "a-b" is split. Anyone care to elaborate? Thanks, -Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
