The selection of analyzers were triggered by the "lang" property in the doc 
object. The lang property of doc were set by the plug-in 
LanguageIdentifier.Unfortunately, LanguageIdentifier can't support Chinese now. 
If you only need to deal with Chinese documents and English documents,you can 
hardcode the lang property of doc to "zh" .In  "Indexer.java", modify the code 
as blow:
// NutchAnalyzer analyzer = factory.get(doc.get("lang"));
   NutchAnalyzer analyzer = factory.get("zh");

----- Original Message ----- 
From: "zhao xiuwen" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, April 03, 2007 12:33 AM
Subject: Replace CJK lanaguage analyzer in nutch


> NutchAnalysis segments CJK term word-by-word.In order to make Nutch
> support Chinese well, I developed a plug-in for Chinese.
> I placed ChineseAnalyzer.class(extends NutchAnalyzer) and
> ChineseTokenizer.class in a jar-nutch-0.8.1/plugins/analysis-zh.jar, and
> configured plugin.xml and nutch-site.xml. I think nutch should
> replace NutchDocumentAnalyzer by ChineseAnalyzer,but nutch didn't. What's
> wrong ?
> * plugin.xml configuration*:
> *  <?xml version="1.0" encoding="UTF-8"?>*
> 
> *<plugin
>   id="analysis-zh"
>   name="Chinese Analysis Plug-in"
>   version="1.0.0"
>   provider-name="org.apache.nutch">*
> 
> *   <runtime>
>      <library name="analysis-zh.jar">
>         <export name="*"/>
>      </library>
>   </runtime>*
> 
> *   <requires>
>      <import plugin="nutch-extensionpoints" />
>   </requires>*
> 
> *   <extension id="org.apache.nutch.analysis.zh"
>              name="ChineseAnalyzer"
>              point="org.apache.nutch.analysis.NutchAnalyzer">*
> 
> *      <implementation id="ChineseAnalyzer"
>                      class="org.apache.nutch.analysis.zh.ChineseAnalyzer">
>        <parameter name="lang" value="zh" />
>      </implementation>*
> 
> *   </extension>*
> 
> *</plugin>*
> 
> *Here are some excerpts from nute-site.xml*
> 
> *<property>
>  <name>plugin.includes</name>
> 
> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-zh</value>
>  <description> indexing and search plugins.
>  </description>
> </property>*
> 
> *Here are some excerpts from the hadoop log:*
> 
> *2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Registered Plugins:
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Site Query Filter
> (query-site)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Html Parse Plug-in
> (parse-html)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL Filter
> Framework (lib-regex-filter)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Indexing
> Filter (index-basic)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Summarizer
> Plug-in (summary-basic)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Text Parse Plug-in
> (parse-text)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  JavaScript Parser
> (parse-js)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL Filter
> (urlfilter-regex)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Query Filter
> (query-basic)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  HTTP Framework
> (lib-http)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  URL Query Filter
> (query-url)
> 2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Chinese Analysis
> Plug-in (analysis-zh)*
> 
> *......*
> 
> *2007-04-02 21:36:26,234 INFO  indexer.Indexer -  Indexing [
> http://2008.163.com/] with analyzer **
> [EMAIL PROTECTED]<[EMAIL PROTECTED]>
> * (null)
> 2007-04-02 21:36:26,359 INFO  indexer.Indexer -  Indexing [
> http://auto.163.com/] with analyzer **
> [EMAIL PROTECTED]<[EMAIL PROTECTED]>
> * (null)*
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to