Hi Nutch Experts, I understood that the default analyzer used during Indexing is NutchDocumentAnalyzer. And I like to put in more control in the term() parser specification. E.g., removing non-ASCII characters. Would you please shed some light on how to achieve this?
I looked at the nonTerm() function but thought it is used only for QueryParser. And I think the "term()" function is what I need to change. My thinking is to let the analyzer eat those non-ASCII characters but don't know how to do that. There are some unicode entries in the TOKEN definition, including "\u0a66"-"\u0a6f" in the digit section. I wonder what's going to happen if I remove these non-ASCII characters from the definition. Thanks in advance for your help! student_t -- View this message in context: http://www.nabble.com/Quick-Questions-about-NutchAnalysis.jj-tp20423722p20423722.html Sent from the Nutch - User mailing list archive at Nabble.com.
