Doğacan Güney-3 wrote:
> 
> On 8/10/07, cybercouf <[EMAIL PROTECTED]> wrote:
>>
>>
>> > What happens when you only inject then generate these urls? Do they get
>> > lost?
>>
>> Good idea, so I injected and generated the 98048 missing urls:
>> injection    :  98048 urls
>> generation :  73930 urls
>> so 24.6% missing urls
>>
>> I also tried with a clean list of 1,000,000 clean .com domains, and I had
>> like 17% missing after generation.
>>
>> > Also, make sure that a normalized version of the same url does not
>> > appear somewhere else. For example, nutch probably normalizes these to
>> > "http://1worldtv.mobi/";, if you have this url somewhere else nutch
>> > will naturally only keep one copy.
>>
>> I double checked that, all my urls are really unique and clean (no ending
>> slash or stuff like this).
> 
> Do you have a distributed setup? If so, nutch 0.8 had a bug (IIRC)
> that some urls may be lost due to clock skews between different
> machines. After inject, try to wait for, say, 1 hour before
> generating.
> 
> -- 
> Doğacan Güney
> 
> 

Yes I have, but I think I already tried to generate from a old crawldb, but
anyway, i'll double check that to be sure!
-- 
View this message in context: 
http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12091899
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to