On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
this is my hadoop log(just in case):

2007-07-25 13:19:57,040 INFO  crawl.Generator - Generator: starting
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: segment:
/home/semantix/nutch-0.9/crawl/segments/20070725131957
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: filtering: true
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: topN: 50000
2007-07-25 13:19:57,095 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2007-07-25 14:42:03,509 INFO  crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2007-07-25 14:42:13,909 INFO  crawl.Generator - Generator: done.


Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I
don't see any obvious mistakes, but perhaps I am missing something.

Could you lower hadoop's log level to INFO (change hadoop's log level
in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop
show how much of maps and reduces are completed so we may see what is
taking so long. if you can attach a profiler and show us which method
is taking all the time that would make things *a lot* more easier.





Luca Rondanini



Emmanuel wrote:
> The partition is slow.
>
> Any idea ?
>
>> On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:
>>
>>> I've got the same problem. It takes ages to generate a list of url to
>>> fetch
>>> with DB of 2M of urls.
>>>
>>> Do you mean that it will be faster if we configure the generation by IP
>>> ?
>>
>>
>> No, it should be faster if you partition by host :)
>>
>> Which job is slow, select or partition?
>>
>>>
>>> I've read the nutch-default.xml. it said that we have to be careful
>>> because
>>> it can generate a lot of DNS request to the host.
>>> Does it mean that i have to configure a local cache DNS on my server ?
>>>
>>> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
>>> >> Hi all,
>>> >> The gerate step of my crwal process is taking more then 2 hours....is
>>> it
>>> >> normal?
>>> >
>>> > Are you partitioning urls by ip or by host?
>>> >
>>> >>
>>> >> this is my stat report:
>>> >>
>>> >> CrawlDb statistics start: crawl/crawldb
>>> >> Statistics for CrawlDb: crawl/crawldb
>>> >> TOTAL urls:     586860
>>> >> retry 0:        578159
>>> >> retry 1:        1983
>>> >> retry 2:        2017
>>> >> retry 3:        4701
>>> >> min score:      0.0
>>> >> avg score:      0.0
>>> >> max score:      1.0
>>> >> status 1 (db_unfetched):        164849
>>> >> status 2 (db_fetched):  417306
>>> >> status 3 (db_gone):     4701
>>> >> status 5 (db_redir_perm):       4
>>> >> CrawlDb statistics: done
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Luca Rondanini
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> > Doğacan Güney
>>> >
>>>
>>
>>
>> --
>> Doğacan Güney
>>
>



--
Doğacan Güney

Reply via email to