sample of missing urls:

http://000ps.mobi
http://zzyzx.mobi
http://0000000000.mobi
http://1worldtv.mobi
http://251escort.mobi

all my urls are like 'http://[a-z0-9\-]+\.mobi'

./bin/nutch inject test0/crawldb listurls                     [crawldb  =
574769 urls]
./bin/nutch generate test0/crawldb test0/segments    [segment = 476723 urls] 
17% missing !

I tried with a input list a bit bigger:
./bin/nutch inject test1/crawldb listurls2                   [crawldb  =
575532 urls]
./bin/nutch generate test1/crawldb test1/segments    [segment = 480436 urls] 
16.5% missing !

and in the confiles, all properties for generator are the defaults one:
generate.max.per.host   -1
generate.max.per.host.by.ip   false

thanks for your help!


Doğacan Güney-3 wrote:
> 
> On 8/9/07, cybercouf <[EMAIL PROTECTED]> wrote:
>>
>> I'm using nutch 0.8, on 6 servers, and quite the default conf. And i
>> noticed
>> that the generation process is losing lots of my urls!
>>
>> I've a list of ~500,000 domains, in a text file, after the injection,
>> using
>> "readdb -stats" i can see that all the urls are in the crawldb.
>> But after doing "nutch generate crawldb segments", there is only ~400,000
>> urls in the fresh new segment! (I can see that with "readseg -list")
>>
>> I tried many times to re-generate, and only once I had all the urls in
>> the
>> segment (without any modifications in the conf files).
>> And I don't have any stuff in the logs, except in the
>> hadoop-nutch-namenode-node.log where i have lots of WARN
>> org.apache.hadoop.fs.FSNamesystem: Zero targets found, forbidden1.size=6
>> forbidden2.size()=0.
>> My haddop conf is:
>> mapred.map.tasks 12
>> mapred.reduce.tasks 12
>> dfs.replication 2
>> (and so using 6 servers)
>>
>> any ideas?
> 
> Can you send a sample of the urls you are losing? A couple will be enough.
> 
>> --
>> View this message in context:
>> http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12070072
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Doğacan Güney
> 
> 

-- 
View this message in context: 
http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12089604
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to