On 8/10/07, cybercouf <[EMAIL PROTECTED]> wrote: > > > > What happens when you only inject then generate these urls? Do they get > > lost? > > Good idea, so I injected and generated the 98048 missing urls: > injection : 98048 urls > generation : 73930 urls > so 24.6% missing urls > > I also tried with a clean list of 1,000,000 clean .com domains, and I had > like 17% missing after generation. > > > Also, make sure that a normalized version of the same url does not > > appear somewhere else. For example, nutch probably normalizes these to > > "http://1worldtv.mobi/", if you have this url somewhere else nutch > > will naturally only keep one copy. > > I double checked that, all my urls are really unique and clean (no ending > slash or stuff like this).
Do you have a distributed setup? If so, nutch 0.8 had a bug (IIRC) that some urls may be lost due to clock skews between different machines. After inject, try to wait for, say, 1 hour before generating. > > -- > View this message in context: > http://www.nabble.com/generate-process%3A-20--missing-urls-%21-tf4241854.html#a12091070 > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
