I have a few quesions about nutch. Currently I am trying to understand how everyting is working.
* Is there any more documentation about nutch, then I found on nutch.org and sourceforge.net ?
* Would it be useful to set up a wiki where everybody could take part on writing a detailed documentation?
* Is there anything about the format of banned-hosts.txt ? If I write down the following line, every domain ending with web.de matches: web.de Also the domain www.party-web.de would be sortet out. Is this intended?
* And last, a question about fetcher:
Sometimes my fetcher does not finish its job. Over a long time I get the massege that there are still a few elements remaining in the HostQueues.
In that cases I kill the fetcher and add a fetcher.done file to continue.
I added the last output of fetcher. Maybe somebody could tell me what's going on.
Many thanks
Matthias Jaekle
------
040222 085525 STS RequestScheduler running for 33:06:03 (119163 seconds)
040222 085525 STS Requests (rate): 3481245 (29 req/sec)
040222 085525 STS fetchList: 1832066 (15 req/sec)
040222 085525 STS robots.txt: 1649179 (13 req/sec)
040222 085525 STS Retries (rate): 110449 (3% retry/tot)
040222 085525 STS fetchList: 10131 (0%)
040222 085525 STS robots.txt: 100318 (6%)
040222 085525 STS Redirects (rate): 163102 (4% redirect/tot)
040222 085525 STS fetchList: 85434 (4%)
040222 085525 STS robots.txt: 77668 (4%)
040222 085525 STS Succeeded (rate): 1759264 (50% succ/req)
040222 085525 STS fetchList: 1466871 (80%)
040222 085525 STS robots.txt: 292393 (17%)
040222 085525 STS Failures (not retryable):
040222 085525 STS All fetchList robots
040222 085525 STS Unknown Failure 0 0 0
040222 085525 STS Bad URL 0 0 0
040222 085525 STS Robots Excluded 32386 31550 836
040222 085525 STS Max Errors 50717 2176 48541
040222 085525 STS Max Redirects 562 208 354
040222 085525 STS Redirect Missing Target 1426 803 623
040222 085525 STS Not Found 730451 44404 686047
040222 085525 STS Forbidden 34044 14850 19194
040222 085525 STS Redirect Loop 2184 78 2106
040222 085525 STS Hostname Banned 13614 13614 0
040222 085525 STS Dead Host 040222 085525 STS All fetchList robots
040222 085525 STS Unknown Error 15031 1256 13775
040222 085525 STS Connection Timed Out 0 0 0
040222 085525 STS Bad Header Line 756 18 738
040222 085525 STS Reset By Peer 1 1 0
040222 085525 STS Bad Status Line 348 122 226
040222 085525 STS EOF During Read 2590 390 2200
040222 085525 STS No Route to Host 17460 45 17415
040222 085525 STS Socket Timeout 123121 8821 114300
040222 085525 STS Network Unreachable 0 0 0
040222 085525 STS Bad Content-Length 1777 1590 187
040222 085525 STS Bad Chunk Length 0 0 0
040222 085525 STS EOF in chunk 31 22 9
040222 085525 STS Unzip Failed 51 42 9
040222 085525 STS Total 161166 12307 148859
040222 085525 STS Output stats:
040222 085525 STS Output OK 1729459
040222 085525 STS Unknown output error 0
040222 085525 STS DOM parse error 6343
040222 085525 STS DOM parser failed 9978
040222 085525 STS Unknown Content-Type 4331
040222 085525 STS Character Encoding Error 0
040222 085525 STS Total 1750111
040222 085525 STS Fetcher polling (all but Succeed cause delays):
040222 085525 STS Polls: 9110573
040222 085525 STS Succeeded: 3259035 (35%)
040222 085525 STS Host Qs Busy: 5829143 (63%)
040222 085525 STS Fetcher delays due to Output Q Full: 136575 (0%)
040222 085525 STS Requests added to Output Q (per add): 1750114 (1.07)
040222 085525 STS Output polling:
040222 085525 STS Polls: 1785260
040222 085525 STS Pops: 1750114
040222 085525 STS Pops w/o delay: 1090566 (61%)
040222 085525 STS Output Q empty: 694694 (38%)
040222 085525 STS actual content bytes fetched: 17335362345 (1136 kbits/s avg)
040222 085525 STS effective content bytes: 18865746130 (1236 kbits/s)
040222 085525 STS content bandwidth savings (compression): 1530383785 (8.1%)
040222 085525 STS effective fetchlist bytes fetched: 18047843476 (95%)
040222 085525 STS effective robots bytes fetched: 817902654 (4%)
040222 085525 STS raw bytes read: 19075594074 (1250 kbits/s)
040222 085525 STS raw bytes sent: 496781070 (32 kbits/s) 189824 174954 14870
040222 085525 STS Unknown Response Code 2651 440 2211
040222 085525 STS Unknown Host 397672 0 397672
040222 085525 STS Connection Refused 6512 166 6346
040222 085525 STS Total 1462043 283243 1178800
040222 085525 STS Errors (retryable):
040222 085525 STS 1773805 requests have been read from the FetchList
040222 085525 STS 1750114 requests have been dispatched for output
040222 085525 STS 23663 requests have been dropped on the floor
040222 085525 STS HostQueue sizes:
040222 085525 STS ready: 0
040222 085525 STS idle: 20000
040222 085525 STS delay: 0
040222 085525 STS busy: 1
040222 085525 STS total: 20001
040222 085525 STS cached:64977
040222 085525 STS HostQueues contain 27 fetchList entries
040222 085525 STS FetchList is empty
------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
