Re: slow generate process

Luca Rondanini Tue, 31 Jul 2007 04:08:38 -0700

Hi,

this is the crawldb stats:


Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     871597
retry 0:        816227
retry 1:        1
retry 3:        55369
min score:      0.0
avg score:      0.0
max score:      22.0
status 1 (db_unfetched):        2
status 2 (db_fetched):  816179
status 3 (db_gone):     55369
status 5 (db_redir_perm):       47
CrawlDb statistics: done


and this is the top output (while selecting):

top - 13:06:48 up 4 days, 23:42,  2 users,  load average: 1.93, 1.86, 1.78
Tasks:  72 total,   2 running,  70 sleeping,   0 stopped,   0 zombie

Cpu(s): 99.0%us, 1.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,0.0%st

Mem:    970596k total,   959372k used,    11224k free,     2592k buffers
Swap:  1951856k total,    15692k used,  1936164k free,   748784k cached



oğacan Güney wrote:

Hi,

On 7/26/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:

Hi Doğacan,
You can find the log at http://www.translated.net/hadoop.log.generate

It's just the output of the generate step...
...maybe you can help me to summarize the log's key-points for future
readers!!



Selecting (not partitioning) that takes most of the time, which
actually is expected since selecting has to process the entire
crawldb. Still, more than 1 hour spent in Selector's map is weird.
What is the size of your crawldb? Also, have you measured memory
consumption (perhaps, selecting hits swap a lot?) ?

Thanks,
Luca

Luca Rondanini

Doğacan Güney wrote:

On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:

this is my hadoop log(just in case):

2007-07-25 13:19:57,040 INFO  crawl.Generator - Generator: starting
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: segment:
/home/semantix/nutch-0.9/crawl/segments/20070725131957
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: filtering:
true
2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: topN: 50000
2007-07-25 13:19:57,095 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2007-07-25 14:42:03,509 INFO  crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2007-07-25 14:42:13,909 INFO  crawl.Generator - Generator: done.


Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I
don't see any obvious mistakes, but perhaps I am missing something.

Could you lower hadoop's log level to INFO (change hadoop's log level
in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop
show how much of maps and reduces are completed so we may see what is
taking so long. if you can attach a profiler and show us which method
is taking all the time that would make things *a lot* more easier.



Luca Rondanini



Emmanuel wrote:

The partition is slow.

Any idea ?

On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:

I've got the same problem. It takes ages to generate a list of url to
fetch
with DB of 2M of urls.

Do you mean that it will be faster if we configure the generation


by IP



No, it should be faster if you partition by host :)

Which job is slow, select or partition?

I've read the nutch-default.xml. it said that we have to be careful
because
it can generate a lot of DNS request to the host.
Does it mean that i have to configure a local cache DNS on my


server ?

On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:

Hi all,
The gerate step of my crwal process is taking more then 2


hours....is

it

normal?


Are you partitioning urls by ip or by host?

this is my stat report:

CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     586860
retry 0:        578159
retry 1:        1983
retry 2:        2017
retry 3:        4701
min score:      0.0
avg score:      0.0
max score:      1.0
status 1 (db_unfetched):        164849
status 2 (db_fetched):  417306
status 3 (db_gone):     4701
status 5 (db_redir_perm):       4
CrawlDb statistics: done





Luca Rondanini



--
DoÃ„Å¸acan GÃƒÂ¼ney


--
DoÄŸacan GÃ¼ney

Re: slow generate process

Reply via email to