On 8/10/07, cybercouf <[EMAIL PROTECTED]> wrote: > > > > Doğacan Güney-3 wrote: > > > > On 8/10/07, cybercouf <[EMAIL PROTECTED]> wrote: > >> > >> > >> > What happens when you only inject then generate these urls? Do they get > >> > lost? > >> > >> Good idea, so I injected and generated the 98048 missing urls: > >> injection : 98048 urls > >> generation : 73930 urls > >> so 24.6% missing urls > >> > >> I also tried with a clean list of 1,000,000 clean .com domains, and I had > >> like 17% missing after generation. > >> > >> > Also, make sure that a normalized version of the same url does not > >> > appear somewhere else. For example, nutch probably normalizes these to > >> > "http://1worldtv.mobi/", if you have this url somewhere else nutch > >> > will naturally only keep one copy. > >> > >> I double checked that, all my urls are really unique and clean (no ending > >> slash or stuff like this). > > > > Do you have a distributed setup? If so, nutch 0.8 had a bug (IIRC) > > that some urls may be lost due to clock skews between different > > machines. After inject, try to wait for, say, 1 hour before > > generating. > > > > -- > > Doğacan Güney > > > > > > Yes I have, but I think I already tried to generate from a old crawldb, but > anyway, i'll double check that to be sure!
The "1-hour" part depends on how large clock skew (or time difference) is between your machines. You may try something like running an ntpdate on each to synchronize their time to keep clock skew to a minimum. > -- > View this message in context: > http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12091899 > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
