Hi,
this is the crawldb stats:
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 871597
retry 0: 816227
retry 1: 1
retry 3: 55369
min score: 0.0
avg score: 0.0
max score: 22.0
status 1 (db_unfetched): 2
status 2 (db_fetched): 816179
status 3 (db_gone): 55369
status 5 (db_redir_perm): 47
CrawlDb statistics: done
and this is the top output (while selecting):
top - 13:06:48 up 4 days, 23:42, 2 users, load average: 1.93, 1.86, 1.78
Tasks: 72 total, 2 running, 70 sleeping, 0 stopped, 0 zombie
Cpu(s): 99.0%us, 1.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 970596k total, 959372k used, 11224k free, 2592k buffers
Swap: 1951856k total, 15692k used, 1936164k free, 748784k cached
oğacan Güney wrote:
Hi,
On 7/26/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
Hi Doğacan,
You can find the log at http://www.translated.net/hadoop.log.generate
It's just the output of the generate step...
...maybe you can help me to summarize the log's key-points for future
readers!!
Selecting (not partitioning) that takes most of the time, which
actually is expected since selecting has to process the entire
crawldb. Still, more than 1 hour spent in Selector's map is weird.
What is the size of your crawldb? Also, have you measured memory
consumption (perhaps, selecting hits swap a lot?) ?
Thanks,
Luca
Luca Rondanini
Doğacan Güney wrote:
On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
this is my hadoop log(just in case):
2007-07-25 13:19:57,040 INFO crawl.Generator - Generator: starting
2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: segment:
/home/semantix/nutch-0.9/crawl/segments/20070725131957
2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: filtering:
true
2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: topN: 50000
2007-07-25 13:19:57,095 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2007-07-25 14:42:03,509 INFO crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2007-07-25 14:42:13,909 INFO crawl.Generator - Generator: done.
Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I
don't see any obvious mistakes, but perhaps I am missing something.
Could you lower hadoop's log level to INFO (change hadoop's log level
in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop
show how much of maps and reduces are completed so we may see what is
taking so long. if you can attach a profiler and show us which method
is taking all the time that would make things *a lot* more easier.
Luca Rondanini
Emmanuel wrote:
The partition is slow.
Any idea ?
On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:
I've got the same problem. It takes ages to generate a list of url to
fetch
with DB of 2M of urls.
Do you mean that it will be faster if we configure the generation
by IP
?
No, it should be faster if you partition by host :)
Which job is slow, select or partition?
I've read the nutch-default.xml. it said that we have to be careful
because
it can generate a lot of DNS request to the host.
Does it mean that i have to configure a local cache DNS on my
server ?
On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
Hi all,
The gerate step of my crwal process is taking more then 2
hours....is
it
normal?
Are you partitioning urls by ip or by host?
this is my stat report:
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 586860
retry 0: 578159
retry 1: 1983
retry 2: 2017
retry 3: 4701
min score: 0.0
avg score: 0.0
max score: 1.0
status 1 (db_unfetched): 164849
status 2 (db_fetched): 417306
status 3 (db_gone): 4701
status 5 (db_redir_perm): 4
CrawlDb statistics: done
Luca Rondanini
--
Doğacan Güney
--
Doğacan Güney