> I guess that the obvious question is "Which characters are
> considered 'punctuation characters'?".
Punctuation = ("_"|"-"|"/"|"."|",")
> In particular, does the analyzer consider "=" (equal) and
> ":" (colon) to be punctuation characters?
":" is special character at QueryParser (if you are using it). If you want to
search it you need to escape it first. At index time this character is ignored.
Like the punctuations. The string ahmet:arslan will produce two tokens ahmet
and arslan. It also breaks words at "=" character in both query/index time.
If you want to understand the behavior of StandardTokenizer, you need to look
at the file StandardTokenizerImpl.jflex. It recognizes the followings as one
token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM},
{CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these
token types, similar to Regular Expression. You can change behavior of
StandardTokenizer by editing this file and generating
StandardTokenizerImpl.java from it. There is also another jflex file named
WikipediaTokenizerImpl.jflex. By looking it you can understand how new token
types can be added.
Ahmet
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]