this is my hadoop log(just in case):

2007-07-25 13:19:57,040 INFO  crawl.Generator - Generator: starting
2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: segment: /home/semantix/nutch-0.9/crawl/segments/20070725131957
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: filtering: true
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: topN: 50000
2007-07-25 13:19:57,095 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2007-07-25 14:42:03,509 INFO crawl.Generator - Generator: Partitioning selected urls by host, for politeness.
2007-07-25 14:42:13,909 INFO  crawl.Generator - Generator: done.




Luca Rondanini



Emmanuel wrote:
The partition is slow.

Any idea ?

On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:

I've got the same problem. It takes ages to generate a list of url to
fetch
with DB of 2M of urls.

Do you mean that it will be faster if we configure the generation by IP
?


No, it should be faster if you partition by host :)

Which job is slow, select or partition?


I've read the nutch-default.xml. it said that we have to be careful
because
it can generate a lot of DNS request to the host.
Does it mean that i have to configure a local cache DNS on my server ?

> On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
>> Hi all,
>> The gerate step of my crwal process is taking more then 2 hours....is
it
>> normal?
>
> Are you partitioning urls by ip or by host?
>
>>
>> this is my stat report:
>>
>> CrawlDb statistics start: crawl/crawldb
>> Statistics for CrawlDb: crawl/crawldb
>> TOTAL urls:     586860
>> retry 0:        578159
>> retry 1:        1983
>> retry 2:        2017
>> retry 3:        4701
>> min score:      0.0
>> avg score:      0.0
>> max score:      1.0
>> status 1 (db_unfetched):        164849
>> status 2 (db_fetched):  417306
>> status 3 (db_gone):     4701
>> status 5 (db_redir_perm):       4
>> CrawlDb statistics: done
>>
>>
>>
>>
>>
>> Luca Rondanini
>>
>>
>
>
> --
> Doğacan Güney
>



--
Doğacan Güney


Reply via email to