Hi First off I'm not using anything from Twisted. I just liked the subject line :)
The folks of this list have been most helpful before and I'm hoping that you'll take pity on a the dazed and confused. I've read stuff on this group and various website and book until my head is spinning... Here is a brief summary of what I'm trying to do and an example below. I have the code below in a single threaded version and use it to test a list of roughly 6000 urls ensure that they "work". If they fail I track the kind of failures and then generate a report. Currently it take about 7 - 9 hours to run through the entire list. I basically create a list from a file containing a list of URLS and then iterate over the list and check each page as I go through the list. I get all sort of flack because it takes so long so I thought I could speed it up by using a Queue and X number of threads. Seems easier said then done. However in my test below I can't even get it to catch a single error in my if statement in the Run() function. I'm stumped as to why. Any help would be Greatly appreciated. and if so inclined pointers on how to limit the number of threads of a give number of threads. Thank you in advance! I really do appreciate it here is what I have so far... Yes there are somethings that are unused from previous test. Oh and to give proper credit this is based on some code from http://starship.python.net/crew/aahz/OSCON2000/SCRIPT2.HTM import threading, Queue from time import sleep, time import urllib2 import formatter import string #toscan = Queue.Queue #scanned = Queue.Queue #workQueue = Queue.Queue() MAX_THREADS = 10 timeout = 90 # sets timeout for urllib2.urlopen() failedlinks = [] # list for failed urls zeromatch = [] # list for 0 result searches t = 0 # used to store starting time for getting a page. pagetime = 0 # time it took to load page slowestpage = 0 # slowest page time fastestpage = 10 # fastest page time cumulative = 0 # total time to load all pages (used to calc. avg) ST_zeroMatch = 'You found 0 products' ST_zeroMatch2 = 'There are no products matching your selection' class Retriever(threading.Thread): def __init__(self, URL): self.done = 0 self.URL = URL self.urlObj = '' self.ST_zeroMatch = ST_zeroMatch print '__init__:self.URL', self.URL threading.Thread.__init__(self) def run(self): print 'In run()' print "Retrieving:", self.URL #self.page = urllib.urlopen(self.URL) #self.body = self.page.read() #self.page.close() self.t = time() self.urlObj = urllib2.urlopen(self.URL) self.pagetime = time() - t self.webpg = self.urlObj.read() print 'Retriever.run: before if' print 'matching', self.ST_zeroMatch print ST_zeroMatch # why does this always drop through even though the If should be true. if (ST_zeroMatch or ST_zeroMatch2) in self.webpg: # I don't think I want to use self.zeromatch, do I? print '** Found zeromatch' zeromatch.append(url) #self.parse() print 'Retriever.run: past if' print 'exiting run()' self.done = 1 # the last 2 Shop.Com Urls should trigger the zeromatch condition sites = ['http://www.foo.com/', 'http://www.shop.com', 'http://www.shop.com/op/aprod-~zzsome+thing', 'http://www.shop.com/op/aprod-~xyzzy' #'http://www.yahoo.com/ThisPageDoesntExist' ] threadList = [] URLs = [] workQueue = Queue.Queue() for item in sites: workQueue.put(item) print workQueue print print 'b4 test in sites' for test in sites: retriever = Retriever(test) retriever.start() threadList.append(retriever) print 'threadList:' print threadList print 'past for test in sites:' while threading.activeCount()>1: print'Zzz...' sleep(1) print 'entering retriever for loop' for retriever in threadList: #URLs.extend(retriever.run()) retriever.run() print 'zeromatch:', zeromatch # even though there are two URLs that that should be here nothing ever gets appeneded to the list. -- http://mail.python.org/mailman/listinfo/python-list