Hello, all. I've got a bit of a trouble with Nutch 1.0 and multilanguage support:
I have fresh install of Nutch and two analysis plugins I'd like to turn on: analysis-de (German) and analysis-ge (Georgian) Here are the innards of my seed file: ----------------------- http://212.72.133.54/l/test.html http://212.72.133.54/l/de.html ----------------------- The first is Georgian, other - German. When I run bin/nutch crawl seed -dir crawl -threads 10 -depth 2 there is not a slightest sign of someone calling any analysis plug-ins, even though it's clearly stated in hadoop.log, that they are on and active: ----------------------- 2009-09-29 16:39:13,328 INFO crawl.Crawl - crawl started in: crawl 2009-09-29 16:39:13,328 INFO crawl.Crawl - rootUrlDir = seed 2009-09-29 16:39:13,328 INFO crawl.Crawl - threads = 10 2009-09-29 16:39:13,328 INFO crawl.Crawl - depth = 2 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: starting 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: urlDir: seed 2009-09-29 16:39:13,390 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-09-29 16:39:13,421 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-29 16:39:15,546 INFO plugin.PluginRepository - Plugins: looking in: C:\cygwin\opt\nutch\plugins 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Registered Plugins: 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Lucene Analysers (lib-lucene-analyzers) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) !!!!!!!!!!!!!!!!!!!!!!!!! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Georgian Analysis Plug-in (analysis-ge) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - German Analysis Plug-in (analysis-de) !!!!!!!!!!!!!!!!!!!!!!!!! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Site Query Filter (query-site) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - More Query Filter (query-more) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - URL Query Filter (query-url) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) ----------------------- At the same time: ----------------------- 2009-09-29 16:39:54,406 INFO lang.LanguageIdentifier - Language identifier configuration [1-4/2048] 2009-09-29 16:39:54,609 INFO lang.LanguageIdentifier - Language identifier plugin supports: it(1000) is(1000) hu(1000) th(1000) sv(1000) ge(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000) el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000) nl(1000) ----------------------- Language indentifier works as a charm at the same time: ----------------------- $ bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl http://212.72.133.54/l/test.html text was identified as ge ----------------------- $ bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl http://212.72.133.54/l/de.html text was identified as de ----------------------- What could have possibly gone wrong? პატივისცემით, დავით ჯაში