Hi all, [Follow up post] I found the method by myself. 1. Write a plugin for your own language. The method can refer to the analysis-de and analysis-fr to wrap the luence analyzer into your plugin.
2. Then you need to add them to your plugin-include list in nutch-site.xml or nutch-sites.xml . Also you need to add the language-identifier 3. [For those language is not supported by language identifier or think language identifier is too slow] OK, their is 50% chance you will fail if you are writing for eurpoean lanuguage, and 100% fail if you writing for Eastern Asia Language. The reason for that is , language-identifier fail - your language is not supported and you will see the default indexer do the indexing task for you. There is 2 method A. Hack the plugin language-identifier. i. hack all the class except the LanguageIdentifier.java: The detail will not mention here, because this is too many step and I write in rush. But 2 principle here is: a. remove all the reference to a LanguageIdentifier object, include declaration and call of this method via this reference. This is much easier if you have an IDE like NetBeans or Eclipse b. remember to change the language variable inner class of HTMLLanguageParser or Change the default return language when all the case fail. ii. change the langmappings.properties to the acutal encoding of your language - include all possible combination, in lower case. e.g. za = za, zah, utf, utf8 For the full list you can refer to the list of Iconv language support list - most system will support everything and you will see your language variance (well, utf-8 can be utf-8 or utf_8 or utf8!). Also, you may need to include the first part if the target encoding has - or _ , like utf-8 written in utf and utf8 in example. then build the language-identifier again *XML is you need to create your own Parser based on HTMLLanguageParser . But you will fail in to default case quite soon if the xml witten bad enough that using UTF-8 as encoding but no lang element here. B. Hack the Indexer.java , mentioned by this post: http://www.mail-archive.com/[EMAIL PROTECTED]/msg05952.html *For CJK, the default CJKAnalyzer can handle most of the case (especially you change documents to unicode...), just let zh/ja/kr go as default case. Vinci wrote: > > Hi all, > > How can I change the analyzer which is used by the indexer for specific > language? Also, can I use all the analyzer that I see in luke? > > Thank you. > -- View this message in context: http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16067807.html Sent from the Nutch - User mailing list archive at Nabble.com.
