Correction to my previous post. I'd said:
When you use the FetchListTool to emit multiple lists, it
intentionally divides up the list using the MD5 value for the link,
so that you get hosts scattered between the lists. But for a single
list, this doesn't happen, and thus the max threads/host value winds
up causing a lot of the threads to spend their time idling, if your
crawl (like mine) is focused.
But actually it's intentionally putting hosts in the same list - I
missed the piece of code where it extracts the host portion before
calculating the MD5. But without leveraging HTTP 1.1 keep alive
support, I don't see why that's a win (rather, it seems to cause more
problems). Guess I need to dig into this more.
-- Ken
On 9/28/05, AJ Chen <[EMAIL PROTECTED]> wrote:
I started the crawler with about 2000 sites. The fetcher could achieve
7 pages/sec initially, but the performance gradually dropped to about 2
pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages
and I used 500 threads. What are the main causes of this slowing down?
Below are sample status:
050927 005952 status: segment 20050927005922, 100 pages, 3 errors,
1784615 bytes, 14611 ms
050927 005952 status: 6.8441586 pages/s, 954.2334 kb/s, 17846.15bytes/page
050927 010005 status: segment 20050927005922, 200 pages, 9 errors,
3656863 bytes, 28170 ms
050927 010005 status: 7.0997515 pages/s, 1014.1726 kb/s, 18284.314
bytes/page
after sometime ...
050927 171818 status: segment 20050927070752, 101400 pages, 7201 errors,
2593026554 bytes, 36216316 ms
050927 171818 status: 2.799843 pages/s, 559.3617 kb/s, 25572.254bytes/page
050927 171832 status: segment 20050927070752, 101500 pages, 7204 errors,
2595591632 bytes, 36230516 ms
050927 171832 status: 2.8015058 pages/s, 559.6956 kb/s, 25572.332bytes/page
Thanks,
> AJ
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200