thanks qi wu.After modification,nutch invoked my plug-in.I will try modify
the plug-in LanguageIdentifier.
2007/4/3, qi wu <[EMAIL PROTECTED]>:
The selection of analyzers were triggered by the "lang" property in the
doc object. The lang property of doc were set by the plug-in
LanguageIdentifier.Unfortunately, LanguageIdentifier can't support Chinese
now. If you only need to deal with Chinese documents and English
documents,you can hardcode the lang property of doc to "zh" .In "
Indexer.java", modify the code as blow:
// NutchAnalyzer analyzer = factory.get(doc.get("lang"));
NutchAnalyzer analyzer = factory.get("zh");
----- Original Message -----
From: "zhao xiuwen" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, April 03, 2007 12:33 AM
Subject: Replace CJK lanaguage analyzer in nutch
> NutchAnalysis segments CJK term word-by-word.In order to make Nutch
> support Chinese well, I developed a plug-in for Chinese.
> I placed ChineseAnalyzer.class(extends NutchAnalyzer) and
> ChineseTokenizer.class in a jar-nutch-0.8.1/plugins/analysis-zh.jar, and
> configured plugin.xml and nutch-site.xml. I think nutch should
> replace NutchDocumentAnalyzer by ChineseAnalyzer,but nutch didn't.
What's
> wrong ?
> * plugin.xml configuration*:
> * <?xml version="1.0" encoding="UTF-8"?>*
>
> *<plugin
> id="analysis-zh"
> name="Chinese Analysis Plug-in"
> version="1.0.0"
> provider-name="org.apache.nutch">*
>
> * <runtime>
> <library name="analysis-zh.jar">
> <export name="*"/>
> </library>
> </runtime>*
>
> * <requires>
> <import plugin="nutch-extensionpoints" />
> </requires>*
>
> * <extension id="org.apache.nutch.analysis.zh"
> name="ChineseAnalyzer"
> point="org.apache.nutch.analysis.NutchAnalyzer">*
>
> * <implementation id="ChineseAnalyzer"
> class="org.apache.nutch.analysis.zh.ChineseAnalyzer
">
> <parameter name="lang" value="zh" />
> </implementation>*
>
> * </extension>*
>
> *</plugin>*
>
> *Here are some excerpts from nute-site.xml*
>
> *<property>
> <name>plugin.includes</name>
>
>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-zh</value>
> <description> indexing and search plugins.
> </description>
> </property>*
>
> *Here are some excerpts from the hadoop log:*
>
> *2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Registered
Plugins:
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Site Query
Filter
> (query-site)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Html Parse
Plug-in
> (parse-html)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Regex URL
Filter
> Framework (lib-regex-filter)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Basic Indexing
> Filter (index-basic)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Basic
Summarizer
> Plug-in (summary-basic)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Text Parse
Plug-in
> (parse-text)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - JavaScript
Parser
> (parse-js)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Regex URL
Filter
> (urlfilter-regex)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Basic Query
Filter
> (query-basic)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - HTTP Framework
> (lib-http)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - URL Query
Filter
> (query-url)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Chinese
Analysis
> Plug-in (analysis-zh)*
>
> *......*
>
> *2007-04-02 21:36:26,234 INFO indexer.Indexer - Indexing [
> http://2008.163.com/] with analyzer **
> [EMAIL PROTECTED]<
[EMAIL PROTECTED]>
> * (null)
> 2007-04-02 21:36:26,359 INFO indexer.Indexer - Indexing [
> http://auto.163.com/] with analyzer **
> [EMAIL PROTECTED]<
[EMAIL PROTECTED]>
> * (null)*
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers