On 8/10/07, cybercouf <[EMAIL PROTECTED]> wrote: > > sample of missing urls: > > http://000ps.mobi > http://zzyzx.mobi > http://0000000000.mobi > http://1worldtv.mobi > http://251escort.mobi
What happens when you only inject then generate these urls? Do they get lost? Also, make sure that a normalized version of the same url does not appear somewhere else. For example, nutch probably normalizes these to "http://1worldtv.mobi/", if you have this url somewhere else nutch will naturally only keep one copy. > > all my urls are like 'http://[a-z0-9\-]+\.mobi' > > ./bin/nutch inject test0/crawldb listurls [crawldb = > 574769 urls] > ./bin/nutch generate test0/crawldb test0/segments [segment = 476723 urls] > 17% missing ! > > I tried with a input list a bit bigger: > ./bin/nutch inject test1/crawldb listurls2 [crawldb = > 575532 urls] > ./bin/nutch generate test1/crawldb test1/segments [segment = 480436 urls] > 16.5% missing ! > > and in the confiles, all properties for generator are the defaults one: > generate.max.per.host -1 > generate.max.per.host.by.ip false > > thanks for your help! > > -- Doğacan Güney
