Re: Finding sentinel text when using a thread pool...

Christopher Reimer via Python-list Sat, 20 May 2017 10:24:07 -0700

On 5/20/2017 1:19 AM, dieter wrote:

If your (590) pages are linked together (such that you must fetch
a page to get the following one) and page fetching is the limiting
factor, then this would limit the parallelizability.

The pages are not linked together. The URL requires a page number. If Irequested 1000 pages in sequence, the first 60% will have comments andthe remaining 40% will have the sentinel text. As more comments areadded over time, the dividing line between the last page with the oldestcomments and the first page with the sentinel page shifts over time.Since I changed the code to fetch 16 pages at the same time, the runtime got reduced by nine minutes.

If processing a selected page takes a significant amount of time
(compared to the fetching), then you could use a work queue as follows:
a page is fetched and the following page determined; if a following
page is found, processing this page is put as a job into the work queue
and page processing is continued. Free tasks look for jobs in the work queue
and process them.

I'm looking into that now. The requester class yields one page at atime. If I change the code to yield a list of 16 pages, I could parse 16pages at a time. That change would require a bit more work but it wouldfix some problems that's been nagging me for a while about the parser class.


Thank you,

Chris Reimer
--
https://mail.python.org/mailman/listinfo/python-list

Re: Finding sentinel text when using a thread pool...

Reply via email to