Re: [Nutch-general] slow generate process

Doğacan Güney Wed, 25 Jul 2007 10:30:04 -0700

On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
> this is my hadoop log(just in case):
>
> 2007-07-25 13:19:57,040 INFO  crawl.Generator - Generator: starting
> 2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: segment:
> /home/semantix/nutch-0.9/crawl/segments/20070725131957
> 2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: filtering: true
> 2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: topN: 50000
> 2007-07-25 13:19:57,095 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> 2007-07-25 14:42:03,509 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2007-07-25 14:42:13,909 INFO  crawl.Generator - Generator: done.
>


Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I
don't see any obvious mistakes, but perhaps I am missing something.

Could you lower hadoop's log level to INFO (change hadoop's log level
in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop
show how much of maps and reduces are completed so we may see what is
taking so long. if you can attach a profiler and show us which method
is taking all the time that would make things *a lot* more easier.


>
>
>
> Luca Rondanini
>
>
>
> Emmanuel wrote:
> > The partition is slow.
> >
> > Any idea ?
> >
> >> On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:
> >>
> >>> I've got the same problem. It takes ages to generate a list of url to
> >>> fetch
> >>> with DB of 2M of urls.
> >>>
> >>> Do you mean that it will be faster if we configure the generation by IP
> >>> ?
> >>
> >>
> >> No, it should be faster if you partition by host :)
> >>
> >> Which job is slow, select or partition?
> >>
> >>>
> >>> I've read the nutch-default.xml. it said that we have to be careful
> >>> because
> >>> it can generate a lot of DNS request to the host.
> >>> Does it mean that i have to configure a local cache DNS on my server ?
> >>>
> >>> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
> >>> >> Hi all,
> >>> >> The gerate step of my crwal process is taking more then 2 hours....is
> >>> it
> >>> >> normal?
> >>> >
> >>> > Are you partitioning urls by ip or by host?
> >>> >
> >>> >>
> >>> >> this is my stat report:
> >>> >>
> >>> >> CrawlDb statistics start: crawl/crawldb
> >>> >> Statistics for CrawlDb: crawl/crawldb
> >>> >> TOTAL urls:     586860
> >>> >> retry 0:        578159
> >>> >> retry 1:        1983
> >>> >> retry 2:        2017
> >>> >> retry 3:        4701
> >>> >> min score:      0.0
> >>> >> avg score:      0.0
> >>> >> max score:      1.0
> >>> >> status 1 (db_unfetched):        164849
> >>> >> status 2 (db_fetched):  417306
> >>> >> status 3 (db_gone):     4701
> >>> >> status 5 (db_redir_perm):       4
> >>> >> CrawlDb statistics: done
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> Luca Rondanini
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> > --
> >>> > DoÃ„Å¸acan GÃƒÂ¼ney
> >>> >
> >>>
> >>
> >>
> >> --
> >> DoÄŸacan GÃ¼ney
> >>
> >
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] slow generate process

Reply via email to