On 8/9/07, cybercouf <[EMAIL PROTECTED]> wrote: > > I'm using nutch 0.8, on 6 servers, and quite the default conf. And i noticed > that the generation process is losing lots of my urls! > > I've a list of ~500,000 domains, in a text file, after the injection, using > "readdb -stats" i can see that all the urls are in the crawldb. > But after doing "nutch generate crawldb segments", there is only ~400,000 > urls in the fresh new segment! (I can see that with "readseg -list") > > I tried many times to re-generate, and only once I had all the urls in the > segment (without any modifications in the conf files). > And I don't have any stuff in the logs, except in the > hadoop-nutch-namenode-node.log where i have lots of WARN > org.apache.hadoop.fs.FSNamesystem: Zero targets found, forbidden1.size=6 > forbidden2.size()=0. > My haddop conf is: > mapred.map.tasks 12 > mapred.reduce.tasks 12 > dfs.replication 2 > (and so using 6 servers) > > any ideas?
Can you send a sample of the urls you are losing? A couple will be enough. > -- > View this message in context: > http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12070072 > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
