Re: Change of analyzer for specific language

Vinci Sat, 15 Mar 2008 06:41:46 -0700

Hi all,
[Follow up post]
I found the method by myself. 
1. Write a plugin for your own language. The method can refer to the
analysis-de and analysis-fr to wrap the luence analyzer into your plugin.

2. Then you need to add them to your plugin-include list in nutch-site.xml
or nutch-sites.xml . Also you need to add the language-identifier 

3. [For those language is not supported by language identifier or think
language identifier is too slow] 
OK, their is 50% chance you will fail if you are writing for eurpoean
lanuguage, and 100% fail if you writing for Eastern Asia Language.

The reason for that is , language-identifier fail  - your language is not
supported and you will see the default indexer do the indexing task for you.

There is 2 method 

A. Hack the plugin language-identifier.
i. hack all the class except the LanguageIdentifier.java: The detail will
not mention here, because this is too many step and I write in rush. But 2
principle here is:
a. remove all the reference to a LanguageIdentifier object, include
declaration and call of this method via this reference. This is much easier
if you have an IDE like NetBeans or Eclipse  
b. remember to change the language variable inner class of
HTMLLanguageParser or Change the default return language when all the case
fail.
ii. change the langmappings.properties to the acutal encoding of your
language - include all possible combination, in lower case. e.g.
za = za, zah, utf, utf8
For the full list you can refer to the list of Iconv language support list -
most system will support everything and you will see your language variance
(well, utf-8 can be utf-8 or utf_8 or utf8!). Also, you may need to include
the first part if the target encoding has - or _ , like utf-8 written in utf
and utf8 in example.

then build the language-identifier again

*XML is you need to create your own Parser based on HTMLLanguageParser . But
you will fail in to default case quite soon if the xml witten bad enough
that using UTF-8 as encoding but no lang element here.

B. Hack the Indexer.java , mentioned by this post:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05952.html
*For CJK, the default CJKAnalyzer can handle most of the case (especially
you change documents to unicode...), just let zh/ja/kr go as default case.

Vinci wrote:
> 
> Hi all,
> 
> How can I change the analyzer which is used by the indexer for specific
> language? Also, can I use all the analyzer that I see in luke?
> 
> Thank you.
> 

-- 
View this message in context: 
http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16067807.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Change of analyzer for specific language

Reply via email to