Hi Nutch Experts,

I understood that the default analyzer used during Indexing is
NutchDocumentAnalyzer. And I like to put in more control in the term()
parser specification. E.g., removing non-ASCII characters. Would you please
shed some light on how to achieve this?

I looked at the nonTerm() function but thought it is used only for
QueryParser. And I think the "term()" function is what I need to change. My
thinking is to let the analyzer eat those non-ASCII characters but don't
know how to do that.

There are some unicode entries in the TOKEN definition, including
"\u0a66"-"\u0a6f" in the digit section. I wonder what's going to happen if I
remove these non-ASCII characters from the definition.

Thanks in advance for your help!

student_t

-- 
View this message in context: 
http://www.nabble.com/Quick-Questions-about-NutchAnalysis.jj-tp20423722p20423722.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to