Hi Ahmet, Thanks for the clarification and information! That was exactly what I was looking for.
Jim ---- AHMET ARSLAN <[email protected]> wrote: > > > I guess that the obvious question is "Which characters are > > considered 'punctuation characters'?". > > Punctuation = ("_"|"-"|"/"|"."|",") > > > In particular, does the analyzer consider "=" (equal) and > > ":" (colon) to be punctuation characters? > > ":" is special character at QueryParser (if you are using it). If you want to > search it you need to escape it first. At index time this character is > ignored. Like the punctuations. The string ahmet:arslan will produce two > tokens ahmet and arslan. It also breaks words at "=" character in both > query/index time. > > If you want to understand the behavior of StandardTokenizer, you need to look > at the file StandardTokenizerImpl.jflex. It recognizes the followings as one > token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, > {CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these > token types, similar to Regular Expression. You can change behavior of > StandardTokenizer by editing this file and generating > StandardTokenizerImpl.java from it. There is also another jflex file named > WikipediaTokenizerImpl.jflex. By looking it you can understand how new token > types can be added. > > Ahmet > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
