I'm using nutch 0.8, on 6 servers, and quite the default conf. And i noticed that the generation process is losing lots of my urls!
I've a list of ~500,000 domains, in a text file, after the injection, using "readdb -stats" i can see that all the urls are in the crawldb. But after doing "nutch generate crawldb segments", there is only ~400,000 urls in the fresh new segment! (I can see that with "readseg -list") I tried many times to re-generate, and only once I had all the urls in the segment (without any modifications in the conf files). And I don't have any stuff in the logs, except in the hadoop-nutch-namenode-node.log where i have lots of WARN org.apache.hadoop.fs.FSNamesystem: Zero targets found, forbidden1.size=6 forbidden2.size()=0. My haddop conf is: mapred.map.tasks 12 mapred.reduce.tasks 12 dfs.replication 2 (and so using 6 servers) any ideas? -- View this message in context: http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12070072 Sent from the Nutch - User mailing list archive at Nabble.com.
