On 5/20/2017 1:19 AM, dieter wrote:
If your (590) pages are linked together (such that you must fetch a page to get the following one) and page fetching is the limiting factor, then this would limit the parallelizability.
The pages are not linked together. The URL requires a page number. If I requested 1000 pages in sequence, the first 60% will have comments and the remaining 40% will have the sentinel text. As more comments are added over time, the dividing line between the last page with the oldest comments and the first page with the sentinel page shifts over time. Since I changed the code to fetch 16 pages at the same time, the run time got reduced by nine minutes.
If processing a selected page takes a significant amount of time (compared to the fetching), then you could use a work queue as follows: a page is fetched and the following page determined; if a following page is found, processing this page is put as a job into the work queue and page processing is continued. Free tasks look for jobs in the work queue and process them.
I'm looking into that now. The requester class yields one page at a time. If I change the code to yield a list of 16 pages, I could parse 16 pages at a time. That change would require a bit more work but it would fix some problems that's been nagging me for a while about the parser class.
Thank you, Chris Reimer -- https://mail.python.org/mailman/listinfo/python-list