sample of missing urls: http://000ps.mobi http://zzyzx.mobi http://0000000000.mobi http://1worldtv.mobi http://251escort.mobi
all my urls are like 'http://[a-z0-9\-]+\.mobi' ./bin/nutch inject test0/crawldb listurls [crawldb = 574769 urls] ./bin/nutch generate test0/crawldb test0/segments [segment = 476723 urls] 17% missing ! I tried with a input list a bit bigger: ./bin/nutch inject test1/crawldb listurls2 [crawldb = 575532 urls] ./bin/nutch generate test1/crawldb test1/segments [segment = 480436 urls] 16.5% missing ! and in the confiles, all properties for generator are the defaults one: generate.max.per.host -1 generate.max.per.host.by.ip false thanks for your help! Doğacan Güney-3 wrote: > > On 8/9/07, cybercouf <[EMAIL PROTECTED]> wrote: >> >> I'm using nutch 0.8, on 6 servers, and quite the default conf. And i >> noticed >> that the generation process is losing lots of my urls! >> >> I've a list of ~500,000 domains, in a text file, after the injection, >> using >> "readdb -stats" i can see that all the urls are in the crawldb. >> But after doing "nutch generate crawldb segments", there is only ~400,000 >> urls in the fresh new segment! (I can see that with "readseg -list") >> >> I tried many times to re-generate, and only once I had all the urls in >> the >> segment (without any modifications in the conf files). >> And I don't have any stuff in the logs, except in the >> hadoop-nutch-namenode-node.log where i have lots of WARN >> org.apache.hadoop.fs.FSNamesystem: Zero targets found, forbidden1.size=6 >> forbidden2.size()=0. >> My haddop conf is: >> mapred.map.tasks 12 >> mapred.reduce.tasks 12 >> dfs.replication 2 >> (and so using 6 servers) >> >> any ideas? > > Can you send a sample of the urls you are losing? A couple will be enough. > >> -- >> View this message in context: >> http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12070072 >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > -- > Doğacan Güney > > -- View this message in context: http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12089604 Sent from the Nutch - User mailing list archive at Nabble.com.
