Correction to my previous post. I'd said:

When you use the FetchListTool to emit multiple lists, it intentionally divides up the list using the MD5 value for the link, so that you get hosts scattered between the lists. But for a single list, this doesn't happen, and thus the max threads/host value winds up causing a lot of the threads to spend their time idling, if your crawl (like mine) is focused.

But actually it's intentionally putting hosts in the same list - I missed the piece of code where it extracts the host portion before calculating the MD5. But without leveraging HTTP 1.1 keep alive support, I don't see why that's a win (rather, it seems to cause more problems). Guess I need to dig into this more.

-- Ken


On 9/28/05, AJ Chen <[EMAIL PROTECTED]> wrote:

 I started the crawler with about 2000 sites. The fetcher could achieve
 7 pages/sec initially, but the performance gradually dropped to about 2
 pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages
 and I used 500 threads. What are the main causes of this slowing down?
 Below are sample status:

 050927 005952 status: segment 20050927005922, 100 pages, 3 errors,
 1784615 bytes, 14611 ms
 050927 005952 status: 6.8441586 pages/s, 954.2334 kb/s, 17846.15bytes/page
 050927 010005 status: segment 20050927005922, 200 pages, 9 errors,
 3656863 bytes, 28170 ms
 050927 010005 status: 7.0997515 pages/s, 1014.1726 kb/s, 18284.314
 bytes/page

 after sometime ...
 050927 171818 status: segment 20050927070752, 101400 pages, 7201 errors,
 2593026554 bytes, 36216316 ms
 050927 171818 status: 2.799843 pages/s, 559.3617 kb/s, 25572.254bytes/page
 050927 171832 status: segment 20050927070752, 101500 pages, 7204 errors,
 2595591632 bytes, 36230516 ms
 050927 171832 status: 2.8015058 pages/s, 559.6956 kb/s, 25572.332bytes/page

 Thanks,
 > AJ

--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to