this is my hadoop log(just in case): 2007-07-25 13:19:57,040 INFO crawl.Generator - Generator: starting 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: segment: /home/semantix/nutch-0.9/crawl/segments/20070725131957 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: filtering: true 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: topN: 50000 2007-07-25 13:19:57,095 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2007-07-25 14:42:03,509 INFO crawl.Generator - Generator: Partitioning selected urls by host, for politeness. 2007-07-25 14:42:13,909 INFO crawl.Generator - Generator: done.
Luca Rondanini Emmanuel wrote: > The partition is slow. > > Any idea ? > >> On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote: >> >>> I've got the same problem. It takes ages to generate a list of url to >>> fetch >>> with DB of 2M of urls. >>> >>> Do you mean that it will be faster if we configure the generation by IP >>> ? >> >> >> No, it should be faster if you partition by host :) >> >> Which job is slow, select or partition? >> >>> >>> I've read the nutch-default.xml. it said that we have to be careful >>> because >>> it can generate a lot of DNS request to the host. >>> Does it mean that i have to configure a local cache DNS on my server ? >>> >>> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: >>> >> Hi all, >>> >> The gerate step of my crwal process is taking more then 2 hours....is >>> it >>> >> normal? >>> > >>> > Are you partitioning urls by ip or by host? >>> > >>> >> >>> >> this is my stat report: >>> >> >>> >> CrawlDb statistics start: crawl/crawldb >>> >> Statistics for CrawlDb: crawl/crawldb >>> >> TOTAL urls: 586860 >>> >> retry 0: 578159 >>> >> retry 1: 1983 >>> >> retry 2: 2017 >>> >> retry 3: 4701 >>> >> min score: 0.0 >>> >> avg score: 0.0 >>> >> max score: 1.0 >>> >> status 1 (db_unfetched): 164849 >>> >> status 2 (db_fetched): 417306 >>> >> status 3 (db_gone): 4701 >>> >> status 5 (db_redir_perm): 4 >>> >> CrawlDb statistics: done >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> Luca Rondanini >>> >> >>> >> >>> > >>> > >>> > -- >>> > DoÄŸacan Güney >>> > >>> >> >> >> -- >> Doğacan Güney >> > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general