Hi Enis, Thanks a lot for the reply. I wasn't too sure about the .jj files, I'll try it out next week.
Regards, -v. -----Original Message----- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 27, 2006 7:39 PM To: [email protected] Subject: Re: Problem in URL tokenization Vishal Shah wrote: > Hi, > > If I understand correctly, there is a common tokenizer for all fields > (URL, content, meta etc.). This tokenizer does not use the underscore > character as a separator. Since a lot of URLs use underscore to separate > different words, it would be better if the URLs are tokenized slightly > differently from the other fields. I tried looking at the > NutchDocumentAnalyzer and related files, but can't figure out a clear > way to implement a new tokenizer for URLs only. Any ideas as to how to > go about doing this? > > Thanks, > > -vishal. > > hi, it is not straightforward to implement this without modifying default tokenizing behavior, first you should copy the NutchAnalysis.jj to URLAnalysis.jj (or something you like) and change | <#WORD_PUNCT: ("_"|"&")> to : | <#WORD_PUNCT: ("&")> and recompile with javaCC. then, you should copy NutchDocumentTokenizer to URLTokenizer, and refactor NutchAnalysisTokenManager instances to URLAnalysisTokenManager instance, then you should write an Analyzer like to private static class URLAnalyzer extends Analyzer { public URLAnalyzer(){ } public TokenStream tokenStream(String field, Reader reader) { return new URLTokenizer(reader); } } and finally, you change NutchDocumentAnalyzer if ("anchor".equals(fieldName)) analyzer = ANCHOR_ANALYZER; else analyzer = CONTENT_ANALYZER; to if ("anchor".equals(fieldName)) analyzer = ANCHOR_ANALYZER; else if("url".equals(fieldName)) analyzer = URL_ANALYZER; else analyzer = CONTENT_ANALYZER; assuming URL_ANALYZER is an instance of URLAnalyzer I have not tested this but it should work as expected. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
