Christopher Reimer via Python-list wrote: > On 8/27/2017 11:54 AM, Peter Otten wrote: > >> The documentation >> >> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup >> >> says you can make the BeautifulSoup object from a string or file. >> Can you give a few more details where the queue comes into play? A small >> code sample would be ideal. > > A worker thread uses a request object to get the page and puts it into > queue as page.content (HTML). Another worker thread gets the > page.content from the queue to apply BeautifulSoup and nothing happens. > > soup = BeautifulSoup(page_content, 'lxml') > print(soup) > > No output whatsoever. If I remove 'lxml', I get the UserWarning that no > parser wasn't explicitly set and get the reference to threading.py at > line 80. > > I verified that page.content that goes into and out of the queue is the > same page.content that goes into and out of a list. > > I read somewhere that BeautifulSoup may not be thread-safe. I've never > had a problem with threads storing the output into a queue. Using a > queue (random order) instead of a list (sequential order) to feed pages > for the input is making it wonky.
Here's a simple example that extracts titles from generated html. It seems to work. Does it resemble what you do? import csv import threading import time from queue import Queue import bs4 def process_html(source, dest, index): while True: html = source.get() if html is DONE: dest.put(DONE) break soup = bs4.BeautifulSoup(html, "lxml") dest.put(soup.find("title").text) def write_csv(source, filename, to_go): with open(filename, "w") as f: writer = csv.writer(f) while True: title = source.get() if title is DONE: to_go -= 1 if not to_go: return else: writer.writerow([title]) NUM_SOUP_THREADS = 10 DONE = object() web_to_soup = Queue() soup_to_file = Queue() soup_threads = [ threading.Thread(target=process_html, args=(web_to_soup, soup_to_file, i)) for i in range(NUM_SOUP_THREADS) ] write_thread = threading.Thread( target=write_csv, args=(soup_to_file, "tmp.csv", NUM_SOUP_THREADS), ) write_thread.start() for thread in soup_threads: thread.start() for i in range(100): web_to_soup.put("<html><head><title>#{}</title></head></html>".format(i)) for i in range(NUM_SOUP_THREADS): web_to_soup.put(DONE) for t in soup_threads: t.join() write_thread.join() -- https://mail.python.org/mailman/listinfo/python-list