OK, I'm trying to use the SolrIndexer with Nutch 1.0 and nothing seems to be sent to Solr.
I've put some more debug logging into the SolrIndexer and SolrWriter classes. It seems like although the SolrWriter class is told to open() and close() it is never told to write() anything in between. Why would that be? Surely nutch should be sending everything to Solr? Is there some other kind of filtering going on? How could I find out? Hadoop is taking ages to do the "map" and then quite quickly the reduce results in nothing... Here is the previous email on the subject in case your emailer hasnt tied the two together. Alex 2009/8/11 Alex McLintock <alex.mclint...@gmail.com>: > Further information to this.... > > I'm running on a single machine in fake clustering mode. > > A tmp directory gets created, with nothing but another empty directory > inside of it. > > The hadoop log file just says the same thing over and over every 30 > seconds.... > > 2009-08-11 20:20:57,803 INFO plugin.PluginRepository - Plugins: > looking in: /local/apps/software/nutch/plugins > 2009-08-11 20:20:58,158 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2009-08-11 20:20:58,159 INFO plugin.PluginRepository - Registered Plugins: > 2009-08-11 20:20:58,159 INFO plugin.PluginRepository - the > nutch core extension points (nutch-extensionpoints) > 2009-08-11 20:20:58,159 INFO plugin.PluginRepository - Basic > Query Filter (query-basic) > 2009-08-11 20:20:58,159 INFO plugin.PluginRepository - Basic > URL Normalizer (urlnormalizer-basic) > 2009-08-11 20:20:58,159 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2009-08-11 20:20:58,159 INFO plugin.PluginRepository - Html > Parse Plug-in (parse-html) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - Site > Query Filter (query-site) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - Regex > URL Filter (urlfilter-regex) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - Http > Protocol Plug-in (protocol-http) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - XML > Response Writer Plug-in (response-xml) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - Regex > URL Normalizer (urlnormalizer-regex) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - OPIC > Scoring Plug-in (scoring-opic) > 2009-08-11 20:20:58,160 INFO plugin.PluginRepository - > CyberNeko HTML Parser (lib-nekohtml) > 2009-08-11 20:20:58,161 INFO plugin.PluginRepository - Anchor > Indexing Filter (index-anchor) > 2009-08-11 20:20:58,161 INFO plugin.PluginRepository - URL > Query Filter (query-url) > 2009-08-11 20:20:58,161 INFO plugin.PluginRepository - Regex > URL Filter Framework (lib-regex-filter) > 2009-08-11 20:20:58,161 INFO plugin.PluginRepository - JSON > Response Writer Plug-in (response-json) > 2009-08-11 20:20:58,161 INFO plugin.PluginRepository - Registered > Extension-Points: > 2009-08-11 20:20:58,161 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2009-08-11 20:20:58,161 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2009-08-11 20:20:58,161 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > 2009-08-11 20:20:58,162 INFO plugin.PluginRepository - Nutch > Field Filter (org.apache.nutch.indexer.field.FieldFilter) > 2009-08-11 20:20:58,162 INFO plugin.PluginRepository - HTML > Parse Filter (org.apache.nutch.parse.HtmlParseFilter) > 2009-08-11 20:20:58,162 INFO plugin.PluginRepository - Nutch > Query Filter (org.apache.nutch.searcher.QueryFilter) > 2009-08-11 20:20:58,162 INFO plugin.PluginRepository - Nutch > Search Results Response Writer > (org.apache.nutch.searcher.response.ResponseWriter) > 2009-08-11 20:20:58,162 INFO plugin.PluginRepository - Nutch > URL Normalizer (org.apache.nutch.net.URLNormalizer) > 2009-08-11 20:20:58,162 INFO plugin.PluginRepository - Nutch > URL Filter (org.apache.nutch.net.URLFilter) > 2009-08-11 20:20:58,162 INFO plugin.PluginRepository - Nutch > Online Search Results Clustering Plugin > (org.apache.nutch.clustering.OnlineClusterer) > 2009-08-11 20:20:58,162 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2009-08-11 20:20:58,162 INFO plugin.PluginRepository - Nutch > Content Parser (org.apache.nutch.parse.Parser) > 2009-08-11 20:20:58,163 INFO plugin.PluginRepository - Nutch > Scoring (org.apache.nutch.scoring.ScoringFilter) > 2009-08-11 20:20:58,163 INFO plugin.PluginRepository - > Ontology Model Loader (org.apache.nutch.ontology.Ontology) > 2009-08-11 20:20:58,171 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2009-08-11 20:20:58,202 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > > > > Is Solr output a plugin, and is it not set up above? > > 2009/8/11 Alex McLintock <alex.mclint...@gmail.com>: >> I'm trying to send my Nutch crawl to SolR. I've "generated, fetched, >> updated", several times. I've done an invertlinks. >> But when I try to do the solrindex it just sits there for ages and >> doesnt seem to stress the solr server at all. >> >> I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04. >> >> /local/apps/software/nutch$ bin/nutch solrindex >> http://rio23:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* >> >> Is there some kind of "verbose" option so that I can better see what >> it is doing? I could maybe insert some extra deugging, or do i need to >> run this in Eclipse? >> >> The Java process seems to be using up most of a core's CPU time so it >> seems to be doing *something*. >> >> This is my first Solr project so I have proved that it is up and >> running, but havent actually added any data to it yet... >> >> Alex >> >