On 8/10/07, cybercouf <[EMAIL PROTECTED]> wrote:
>
>
> > What happens when you only inject then generate these urls? Do they get
> > lost?
>
> Good idea, so I injected and generated the 98048 missing urls:
> injection    :  98048 urls
> generation :  73930 urls
> so 24.6% missing urls
>
> I also tried with a clean list of 1,000,000 clean .com domains, and I had
> like 17% missing after generation.
>
> > Also, make sure that a normalized version of the same url does not
> > appear somewhere else. For example, nutch probably normalizes these to
> > "http://1worldtv.mobi/";, if you have this url somewhere else nutch
> > will naturally only keep one copy.
>
> I double checked that, all my urls are really unique and clean (no ending
> slash or stuff like this).

Do you have a distributed setup? If so, nutch 0.8 had a bug (IIRC)
that some urls may be lost due to clock skews between different
machines. After inject, try to wait for, say, 1 hour before
generating.

>
> --
> View this message in context: 
> http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12091070
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Doğacan Güney

Reply via email to