this is my hadoop log(just in case):

2007-07-25 13:19:57,040 INFO  crawl.Generator - Generator: starting
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: segment: 
/home/semantix/nutch-0.9/crawl/segments/20070725131957
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: filtering: true
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: topN: 50000
2007-07-25 13:19:57,095 INFO  crawl.Generator - Generator: jobtracker is 
'local', generating exactly one partition.
2007-07-25 14:42:03,509 INFO  crawl.Generator - Generator: Partitioning 
selected urls by host, for politeness.
2007-07-25 14:42:13,909 INFO  crawl.Generator - Generator: done.




Luca Rondanini



Emmanuel wrote:
> The partition is slow.
> 
> Any idea ?
> 
>> On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:
>>
>>> I've got the same problem. It takes ages to generate a list of url to
>>> fetch
>>> with DB of 2M of urls.
>>>
>>> Do you mean that it will be faster if we configure the generation by IP
>>> ?
>>
>>
>> No, it should be faster if you partition by host :)
>>
>> Which job is slow, select or partition?
>>
>>>
>>> I've read the nutch-default.xml. it said that we have to be careful
>>> because
>>> it can generate a lot of DNS request to the host.
>>> Does it mean that i have to configure a local cache DNS on my server ?
>>>
>>> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
>>> >> Hi all,
>>> >> The gerate step of my crwal process is taking more then 2 hours....is
>>> it
>>> >> normal?
>>> >
>>> > Are you partitioning urls by ip or by host?
>>> >
>>> >>
>>> >> this is my stat report:
>>> >>
>>> >> CrawlDb statistics start: crawl/crawldb
>>> >> Statistics for CrawlDb: crawl/crawldb
>>> >> TOTAL urls:     586860
>>> >> retry 0:        578159
>>> >> retry 1:        1983
>>> >> retry 2:        2017
>>> >> retry 3:        4701
>>> >> min score:      0.0
>>> >> avg score:      0.0
>>> >> max score:      1.0
>>> >> status 1 (db_unfetched):        164849
>>> >> status 2 (db_fetched):  417306
>>> >> status 3 (db_gone):     4701
>>> >> status 5 (db_redir_perm):       4
>>> >> CrawlDb statistics: done
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Luca Rondanini
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> > Doğacan Güney
>>> >
>>>
>>
>>
>> -- 
>> Doğacan Güney
>>
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to