>>>>> Aahz <[email protected]> (A) wrote: >A> On Fri, Jun 12, 2009, Jeremy Hylton wrote: >>> >>> I'm not sure I understand how to distinguish between I/O bound threads >>> and CPU bound threads. If you've got a relatively simple >>> multi-threaded application like an HTTP fetcher with a thread pool >>> fetching a lot of urls, you're probably going to end up having more >>> than one thread with input to process at any instant. There's a ton >>> of Python code that executes when that happens. You've got a urllib >>> addinfourl wrapper, a httplib HTTPResponse (with read & _safe_read) >>> and a socket _fileobject. Heaven help you if you are using readline. >>> So I could image even this trivial I/O bound program having lots of >>> CPU contention.
>A> You could imagine, but have you tested it? ;-) Back in the 1.5.2 days, >A> I helped write a web crawler where the sweet spot was around twenty or >A> thirty threads. That clearly indicates a significant I/O bottleneck. I have written a small script to test this. It fires up a couple of threads (or does it unthreaded) that each fetch a couple of web pages (random google searches to be precise). It then measures some things like the CPU percentage (using the psutil module, but you could also do it with the ps command of course). You can also choose to do some CPU processing, such as HTML parsing or hash calculation. And writing something to a file. I noticed some 5 - 15 % CPU utilisation on my 2-core MacBook, when at home on a 4Mb/s ADSL line. So apparently I/O bound. I guess on the high speed university network the CPU load may be a bit higher. I'll test that tomorrow at work. And with respect to readline, I don't think there are problems with that in newer Python versions. My program has an option to use readline instead of read and I see no significant differences. Anyway here is the program.
#!/usr/bin/env python # Author: Piet van Oostrum <[email protected]> # This software is free (no rights reserved). """ This program tries to test the speed of fetching web pages and doing some processing on them in a multithreaded environment. The main purpose is to see how much CPU time it uses so that we might draw some conclusions about the effectivity of using threads in Python. Normally O.S. Threads should help to get greater throughput, but Python's GIL may hinder this. The web pages will be the results of some Google searches. You call this program with the following command line args: - number of pages to be fetched - number of threads to be used. 0 means do everything in main thread > 0 means start that many threads - flags: r = use readline instead of read h calculate SHA1 and MD5 hashes of the pages p do some HTML parsing on the pages w write some information to logfile (length and/or calculated hash) """ import sys import os from random import random import urllib2 import hashlib import psutil process = psutil.Process(os.getpid()) import time start_time = time.time() def usage(help): progname = sys.argv[0] if help: print __doc__ else: print >> sys.stderr, """Usage: %s npages nthreads flags For more help: %s help """ % (progname, progname) sys.exit(1) from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.ntags = 0 self.depth = 0 self.maxdepth = 0 def handle_starttag(self, tag, attrs): self.ntags += 1 self.depth +=1 if self.depth > self.maxdepth: self.maxdepth = self.depth def handle_endtag(self, tag): self.depth -= 1 class DummyLock(object): '''Dummy Lock class only used as context handler (therefore no aquire and release necessary) ''' def __enter__(self): pass def __exit__(self, et, ev, tb): pass # get some search terms words = """acutely alarmclock anaesthesia antitypical arteries autochthones bargain bestowal blondes brazen butterfingers buttermilk captions cedarwood cherries circumference codification compliments contagious cotangent crucified daiquiri defence deplete diagrams discontinue dixieland ducts elastomers endodontist epistemic evaporator extravert fertilizer flicker fortuitous futurology geometry godzilla grovel handwriter hemlock hologram hydrologic ikebana incite ingrowth internally islamization jungle kurdish leftmost lipstick lymphocyte manufactory melancholia nests nonharmonic obscene opus overabundant pagesize partaker percolator philosophy pirouette policy preacher primogenital protuberance pyrite rangers reconvert reindeer reroute rhapsody rudeness saturday scurry servant sidewalk slurry soul sprawl still subentry supersede temper thorny tortilla trichome twine undercover unload unwed velcro vocation wheel wrong zoologic""".split() nwords = len(words) google = "http://www.google.nl/search?q=" logfile = "testthreads.log" BUFSIZE = 1024 try: if sys.argv[1].strip().lower() == 'help': usage(True) npages = int(sys.argv[1]) nthreads = int(sys.argv[2]) if len(sys.argv) < 4: flags = '' else: flags = sys.argv[3] except (ValueError, IndexError): usage(False) use_readline = 'r' in flags do_hash = 'h' in flags do_parse = 'p' in flags do_write = 'w' in flags user_agent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/525.28.3 (KHTML, like Gecko)" headers = { 'User-Agent' : user_agent } def doit(np, lock): '''Fetch np web pages. lock will be used for exclusive access to the log file. Global variables do_hash and do_write will determine the behaviour. ''' for i in range(np): url = google + "+".join((words[int(nwords * random())] for w in range(3))) req = urllib2.Request(url, None, headers) doc = urllib2.urlopen(req) docsize = 0 if do_hash: h1 = hashlib.sha1() h2 = hashlib.md5() if do_parse: parser = MyHTMLParser() while True: if use_readline: data = doc.readline() else: data = doc.read(BUFSIZE) if not data: break docsize += len(data) if do_hash: h1.update(data) h2.update(data) if do_parse: parser.feed(data) if do_parse: parser.close() if do_write: with lock: log = open(logfile, 'a') print >>log, "URL: %s, size: %d" % (url, docsize) if do_hash: print >>log, "sha1:", h1.hexdigest() print >>log, "md5:", h2.hexdigest() if do_parse: print >>log, "Read %d tags, max depth: %d" % \ (parser.ntags, parser.maxdepth) log.close() def start_thread(np, lock): '''Start a new thread fetching np pages, using lock for exclusive access to the logfile. The thread is put in the running_threads list. ''' thr = threading.Thread(target = doit, args = (np, lock)) thr.start() running_threads.append(thr) running_threads = [] lock = DummyLock() if nthreads == 0: doit(npages, lock) else: import threading np = npages//nthreads np1 = npages - np*(nthreads - 1) if do_write: lock = threading.Lock() start_thread(np1, lock) for i in range(1, nthreads): start_thread(np, lock) # Wait for all threads to finish for thr in running_threads: thr.join() print "CPU time (system): %.2f, (user): %.2f secs." % process.get_cpu_times() print "Elapsed time: %.2f secs." % (time.time() - start_time) print "CPU utilisation: %.2f %%" % process.get_cpu_percent()
-- Piet van Oostrum <[email protected]> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: [email protected]
_______________________________________________ concurrency-sig mailing list [email protected] http://mail.python.org/mailman/listinfo/concurrency-sig
