Hello, all.

I've got a bit of a trouble with Nutch 1.0 and multilanguage support:

I have fresh install of Nutch and two analysis plugins I'd like to turn on:
analysis-de (German) and analysis-ge (Georgian)
Here are the innards of my seed file:
-----------------------
http://212.72.133.54/l/test.html
http://212.72.133.54/l/de.html
-----------------------
The first is Georgian, other - German. When I run

bin/nutch crawl seed -dir crawl -threads 10 -depth 2

there is not a slightest sign of someone calling any analysis
plug-ins, even though it's clearly stated in hadoop.log, that they are
on and active:
-----------------------
2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
looking in: C:\cygwin\opt\nutch\plugins
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
Analysers (lib-lucene-analyzers)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
Identification Parser/Filter (language-identifier)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)

!!!!!!!!!!!!!!!!!!!!!!!!!
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
Analysis Plug-in (analysis-ge)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
Analysis Plug-in (analysis-de)
!!!!!!!!!!!!!!!!!!!!!!!!!

2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
Filter (query-more)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
URL Normalizer (urlnormalizer-pass)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
Plug-in (protocol-http)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         JavaScript
Parser (parse-js)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
-----------------------

At the same time:

-----------------------
2009-09-29 16:39:54,406 INFO  lang.LanguageIdentifier - Language
identifier configuration [1-4/2048]
2009-09-29 16:39:54,609 INFO  lang.LanguageIdentifier - Language
identifier plugin supports: it(1000) is(1000) hu(1000) th(1000)
sv(1000) ge(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000)
el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000)
nl(1000)
-----------------------

Language indentifier works as a charm at the same time:
-----------------------
$ bin/nutch plugin language-identifier
org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
http://212.72.133.54/l/test.html
text was identified as ge
-----------------------
$ bin/nutch plugin language-identifier
org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
http://212.72.133.54/l/de.html
text was identified as de
-----------------------

What could have possibly gone wrong?

პატივისცემით,
დავით ჯაში

Reply via email to