I'm using nutch 0.8, on 6 servers, and quite the default conf. And i noticed
that the generation process is losing lots of my urls!

I've a list of ~500,000 domains, in a text file, after the injection, using
"readdb -stats" i can see that all the urls are in the crawldb.
But after doing "nutch generate crawldb segments", there is only ~400,000
urls in the fresh new segment! (I can see that with "readseg -list")

I tried many times to re-generate, and only once I had all the urls in the
segment (without any modifications in the conf files). 
And I don't have any stuff in the logs, except in the
hadoop-nutch-namenode-node.log where i have lots of WARN
org.apache.hadoop.fs.FSNamesystem: Zero targets found, forbidden1.size=6
forbidden2.size()=0. 
My haddop conf is:
mapred.map.tasks 12
mapred.reduce.tasks 12
dfs.replication 2
(and so using 6 servers)

any ideas?
-- 
View this message in context: 
http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12070072
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to