We did some analyzing for 0.8:
- generate and updatedb works just fine while fetch is extremely slow.
It takes 2 sec to fetch one page comparing to 0.7 who fetched 20 pages
per sec.
- during fetch the box is at 100% CPU (3 Ghz pentium) which is quite odd.
- we checked log: URL fetching goes normal until "rawl.SignatureFactory
- Using Signature impl: org.apache.nutch.crawl.MD5Signature"
log entry. After that fetching slows down.
- we injected only two URLs and also set both of them in regexp-filter.
Hope this can help someone.
One question, though: anyone knows how to set more verbose logging?
Thanks.
2006-08-01 19:58:37,576 INFO fetcher.Fetcher - fetching
http://www.foo.com/faq.php
2006-08-01 19:58:37,599 INFO http.Http - http.proxy.host = null
2006-08-01 19:58:37,599 INFO http.Http - http.proxy.port = 8080
2006-08-01 19:58:37,599 INFO http.Http - http.timeout = 10000
2006-08-01 19:58:37,600 INFO http.Http - http.content.limit = 65536
2006-08-01 19:58:37,600 INFO http.Http - http.agent = siBot/siBot-0.1
(http://www.foo.com/; [EMAIL PROTECTED])
2006-08-01 19:58:37,600 INFO http.Http - fetcher.server.delay = 5000
2006-08-01 19:58:37,600 INFO http.Http - http.max.delays = 100
2006-08-01 19:58:38,103 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2006-08-01 19:58:38,145 INFO fetcher.Fetcher - fetching
http://www.foo.com/izobrazevanje.php
2006-08-01 19:58:43,569 INFO fetcher.Fetcher - fetching
http://www.foo.com/kontakti.php
2006-08-01 19:58:48,624 INFO fetcher.Fetcher - fetching
http://www.foo.com/portfolio_mailing.php
2006-08-01 19:58:53,553 INFO fetcher.Fetcher - fetching
http://www.foo.com/online_katalogi.php
2006-08-01 19:58:58,597 INFO fetcher.Fetcher - fetching
http://www.foo.com/postavitev_sistemov.php
2006-08-01 19:59:03,592 INFO fetcher.Fetcher - fetching
http://www.foo.com/internet_aplikacije.php
2006-08-01 19:59:08,655 INFO fetcher.Fetcher - fetching
http://www.foo.com/gradivo.php
ATB,
Vasja
Stefan Groschupf wrote:
Check:
http://issues.apache.org/jira/browse/NUTCH-233
and let us know if it helps.
Stefan
Am 31.07.2006 um 07:46 schrieb Matthew Holt:
Fetcher for one, and the mapreduce takes forever... IE the mapreduce
is kind of annoying... is it possible to disable it if I'm not
running on a DFS?
Matt
06/07/25 20:59:12 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:14 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:19 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:23 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:29 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:33 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:34 INFO mapred.JobClient: map 100% reduce 96%
06/07/25 20:59:40 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:41 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:42 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:47 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:48 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:52 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 20:59:53 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:05 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:22 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:29 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:00:39 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:01:07 INFO mapred.LocalJobRunner: reduce > reduce
06/07/25 21:01:08 INFO mapred.JobClient: map 100% reduce 97%
06/07/25 21:01:16 INFO mapred.LocalJobRunner: reduce > reduce
Sami Siren wrote:
Are you experiencing slowness in general or just on some parts of
the process.
Current fetcher is deadslow and it should be given immediate
attention. there have been some talk about the issue but I havent
seen any code yet.
-- Sami Siren
Matthew Holt wrote:
I agree. Is there anyway to disable something to speed it up? IE is
the map reduce currently needed if we're not on a DFS?
Matt
Vasja Ocvirk wrote:
Hello,
I'm wondering if anyone can help. We injected 1000 seed URLs into
Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and
it processed them in just few hours. We just switched to 0.8 with
same configuration, same URLs, but it seems everything slowed down
significantly. Crawl script has 60 threads -- same as before but
now it works much slower.
Thanks!
Best,
Vasja
__________ NOD32 1.1533 (20060512) Information __________
This message was checked by NOD32 antivirus system.
http://www.eset.com
__________ NOD32 1.1533 (20060512) Information __________
This message was checked by NOD32 antivirus system.
http://www.eset.com