On 8/10/07, cybercouf <[EMAIL PROTECTED]> wrote:
>
> sample of missing urls:
>
> http://000ps.mobi
> http://zzyzx.mobi
> http://0000000000.mobi
> http://1worldtv.mobi
> http://251escort.mobi

What happens when you only inject then generate these urls? Do they get lost?

Also, make sure that a normalized version of the same url does not
appear somewhere else. For example, nutch probably normalizes these to
"http://1worldtv.mobi/";, if you have this url somewhere else nutch
will naturally only keep one copy.

>
> all my urls are like 'http://[a-z0-9\-]+\.mobi'
>
> ./bin/nutch inject test0/crawldb listurls                     [crawldb  =
> 574769 urls]
> ./bin/nutch generate test0/crawldb test0/segments    [segment = 476723 urls]
> 17% missing !
>
> I tried with a input list a bit bigger:
> ./bin/nutch inject test1/crawldb listurls2                   [crawldb  =
> 575532 urls]
> ./bin/nutch generate test1/crawldb test1/segments    [segment = 480436 urls]
> 16.5% missing !
>
> and in the confiles, all properties for generator are the defaults one:
> generate.max.per.host   -1
> generate.max.per.host.by.ip   false
>
> thanks for your help!
>
>

-- 
Doğacan Güney

Reply via email to