Here is some more text from the log. It seems that it slows down at
mapred.LocalJobRunner
2006-08-02 10:12:28,160 INFO mapred.LocalJobRunner - 36 pages, 0
errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:28,900 DEBUG http.Http - fetching
http://www.foo.com/internet_aplikacije.php
2006-08-02 10:12:28,918 DEBUG http.Http - fetched 25812 bytes from
http://www.foo.com/internet_aplikacije.php
2006-08-02 10:12:28,920 DEBUG parse.ParseUtil - Parsing
[http://www.foo.com/internet_aplikacije.php] with
[EMAIL PROTECTED]
2006-08-02 10:12:28,920 DEBUG parse.html -
http://www.foo.com/internet_aplikacije.php: setting encoding to ISO-8859-2
2006-08-02 10:12:28,920 DEBUG parse.html - Parsing...
2006-08-02 10:12:28,932 DEBUG parse.html - Meta tags for
http://www.foo.com/internet_aplikacije.php: base=null, noCache=false,
noFollow=false, noIndex=false, refresh=false, refreshHref=null
* general tags:
- keywords =
cms,aplikacija,modul,vodenje,kontaktov,trgovina,upravljanje,vo??ilnice,anketa,internet
trgovina,izdelava
- author = Foo
- description = Aplikacija za vodenje kontaktov. CMS -
Sistem za upravljanje z vsebinami.
- robots = INDEX,FOLLOW
* http-equiv tags:
- content-type = text/html; charset=iso-8859-2
2006-08-02 10:12:28,932 DEBUG parse.html - Getting text...
2006-08-02 10:12:28,938 DEBUG parse.html - Getting title...
2006-08-02 10:12:28,938 DEBUG parse.html - Getting links...
2006-08-02 10:12:28,942 DEBUG parse.html - found 160 outlinks in
http://www.foo.com/internet_aplikacije.php
2006-08-02 10:12:29,162 INFO mapred.LocalJobRunner - 37 pages, 0
errors, 0.3 pages/s, 52 kb/s,
2006-08-02 10:12:30,164 INFO mapred.LocalJobRunner - 37 pages, 0
errors, 0.3 pages/s, 52 kb/s,
2006-08-02 10:12:31,166 INFO mapred.LocalJobRunner - 37 pages, 0
errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:32,168 INFO mapred.LocalJobRunner - 37 pages, 0
errors, 0.3 pages/s, 51 kb/s,
2006-08-02 10:12:33,170 INFO mapred.LocalJobRunner - 37 pages, 0
errors, 0.3 pages/s, 50 kb/s,
2006-08-02 10:12:33,918 DEBUG http.Http - fetching
http://www.foo.com/mediji.php
ATB,
Vasja
Zaheed Haque wrote:
One question, though: anyone knows how to set more verbose logging?
You can edit your log4j properties under nutch/conf to enable DEBUG
mode both for hadoop and nutch.
Cheers
Thanks.
2006-08-01 19:58:37,576 INFO fetcher.Fetcher - fetching
http://www.foo.com/faq.php
2006-08-01 19:58:37,599 INFO http.Http - http.proxy.host = null
2006-08-01 19:58:37,599 INFO http.Http - http.proxy.port = 8080
2006-08-01 19:58:37,599 INFO http.Http - http.timeout = 10000
2006-08-01 19:58:37,600 INFO http.Http - http.content.limit = 65536
2006-08-01 19:58:37,600 INFO http.Http - http.agent = siBot/siBot-0.1
(http://www.foo.com/; [EMAIL PROTECTED])
2006-08-01 19:58:37,600 INFO http.Http - fetcher.server.delay = 5000
2006-08-01 19:58:37,600 INFO http.Http - http.max.delays = 100
2006-08-01 19:58:38,103 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2006-08-01 19:58:38,145 INFO fetcher.Fetcher - fetching
http://www.foo.com/izobrazevanje.php
2006-08-01 19:58:43,569 INFO fetcher.Fetcher - fetching
http://www.foo.com/kontakti.php
2006-08-01 19:58:48,624 INFO fetcher.Fetcher - fetching
http://www.foo.com/portfolio_mailing.php
2006-08-01 19:58:53,553 INFO fetcher.Fetcher - fetching
http://www.foo.com/online_katalogi.php
2006-08-01 19:58:58,597 INFO fetcher.Fetcher - fetching
http://www.foo.com/postavitev_sistemov.php
2006-08-01 19:59:03,592 INFO fetcher.Fetcher - fetching
http://www.foo.com/internet_aplikacije.php
2006-08-01 19:59:08,655 INFO fetcher.Fetcher - fetching
http://www.foo.com/gradivo.php
ATB,
Vasja
Stefan Groschupf wrote:
> Check:
> http://issues.apache.org/jira/browse/NUTCH-233
> and let us know if it helps.
> Stefan
>
>
> Am 31.07.2006 um 07:46 schrieb Matthew Holt:
>
>> Fetcher for one, and the mapreduce takes forever... IE the mapreduce
>> is kind of annoying... is it possible to disable it if I'm not
>> running on a DFS?
>> Matt
>>
>> 06/07/25 20:59:12 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:14 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:19 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:23 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:29 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:33 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:34 INFO mapred.JobClient: map 100% reduce 96%
>> 06/07/25 20:59:40 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:41 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:42 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:47 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:48 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:52 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 20:59:53 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:00:05 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:00:22 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:00:29 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:00:39 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:01:07 INFO mapred.LocalJobRunner: reduce > reduce
>> 06/07/25 21:01:08 INFO mapred.JobClient: map 100% reduce 97%
>> 06/07/25 21:01:16 INFO mapred.LocalJobRunner: reduce > reduce
>>
>>
>> Sami Siren wrote:
>>> Are you experiencing slowness in general or just on some parts of
>>> the process.
>>>
>>> Current fetcher is deadslow and it should be given immediate
>>> attention. there have been some talk about the issue but I havent
>>> seen any code yet.
>>>
>>> -- Sami Siren
>>>
>>> Matthew Holt wrote:
>>>> I agree. Is there anyway to disable something to speed it up? IE is
>>>> the map reduce currently needed if we're not on a DFS?
>>>>
>>>> Matt
>>>>
>>>> Vasja Ocvirk wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm wondering if anyone can help. We injected 1000 seed URLs into
>>>>> Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and
>>>>> it processed them in just few hours. We just switched to 0.8 with
>>>>> same configuration, same URLs, but it seems everything slowed down
>>>>> significantly. Crawl script has 60 threads -- same as before but
>>>>> now it works much slower.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Best,
>>>>> Vasja
>>>>>
>>>>> __________ NOD32 1.1533 (20060512) Information __________
>>>>>
>>>>> This message was checked by NOD32 antivirus system.
>>>>> http://www.eset.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
> __________ NOD32 1.1533 (20060512) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>