On 8/9/07, cybercouf <[EMAIL PROTECTED]> wrote:
>
> I'm using nutch 0.8, on 6 servers, and quite the default conf. And i noticed
> that the generation process is losing lots of my urls!
>
> I've a list of ~500,000 domains, in a text file, after the injection, using
> "readdb -stats" i can see that all the urls are in the crawldb.
> But after doing "nutch generate crawldb segments", there is only ~400,000
> urls in the fresh new segment! (I can see that with "readseg -list")
>
> I tried many times to re-generate, and only once I had all the urls in the
> segment (without any modifications in the conf files).
> And I don't have any stuff in the logs, except in the
> hadoop-nutch-namenode-node.log where i have lots of WARN
> org.apache.hadoop.fs.FSNamesystem: Zero targets found, forbidden1.size=6
> forbidden2.size()=0.
> My haddop conf is:
> mapred.map.tasks 12
> mapred.reduce.tasks 12
> dfs.replication 2
> (and so using 6 servers)
>
> any ideas?

Can you send a sample of the urls you are losing? A couple will be enough.

> --
> View this message in context: 
> http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12070072
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Doğacan Güney

Reply via email to