> What happens when you only inject then generate these urls? Do they get
> lost?

Good idea, so I injected and generated the 98048 missing urls:
injection    :  98048 urls
generation :  73930 urls
so 24.6% missing urls

I also tried with a clean list of 1,000,000 clean .com domains, and I had
like 17% missing after generation.

> Also, make sure that a normalized version of the same url does not
> appear somewhere else. For example, nutch probably normalizes these to
> "http://1worldtv.mobi/";, if you have this url somewhere else nutch
> will naturally only keep one copy.

I double checked that, all my urls are really unique and clean (no ending
slash or stuff like this).

-- 
View this message in context: 
http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12091070
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to