On 10/03/2018 5:13 PM, Mike Dewhirst wrote:
I've run the process a couple of times and there doesn't seem to be an
appreciable difference. Both methods take enough time to boil the
kettle. I know that isn't proper testing. It might be difficult to
test timing accurately when we are waiting on websites all over the
world to respond. Might set up a long running test to try and smooth
out the differences.
Mmmmmmmm.
parallel.py:547: UserWarning: Multiprocessing-backed parallel loops
cannot be nested below threads, setting n_jobs=1
**self._backend_args)
I think I'll go back to sequential scraping and slap myself on the wrist
for premature optimisation.
M
On 10/03/2018 5:04 PM, Mike Dewhirst wrote:
On 9/03/2018 7:30 PM, Alejandro Dubrovsky wrote:
delayed is a decorator, so it takes a function or a method. You are
passing it a generator instead.
def make_links(self):
Parallel(n_jobs=-2)(delayed(scrape_db)(self,
create_useful_link(self, Link, db), db) for db in databases
)
should work,
Yes it does :) Thank you Alejandro
but it will only parallelise over the scrape_db calls, not the
create_useful_link calls I think. Which of the two do you want to
parallelise over? Or were you after parallelising both?
I think I probably want to use Celery (thanks Ed for the suggestion)
or similar so I can loop through (currently) nine databases and kick
off a scrape_db() task for each. Then each scrape_db task looks for
(currently) ten data items of specific interest. Having scraped a
data item we need to get_or_create (this is in Django) the specific
data note and add the result to whatever is there.
That data note update might be a bottleneck with more than one
scrape_db task in parallel retrieving the same data item; say aqueous
solubility. We want aqueous solubility from all databases in the same
note so the user can easily compare different values and decide which
value to use.
So parallelising everything might eventually be somewhat problematic.
It all has to squeeze through Postgres atomic transactions right at
the end. I suppose this is a perfect example of an IO bound task.
Also, another thing is that the app is (currently) all server side.
I'm not (yet) using AJAX to update the screen when the data becomes
available.
Cheers
Mike
On 09/03/18 18:41, Mike Dewhirst wrote:
https://media.readthedocs.org/pdf/joblib/latest/joblib.pdf
I'm trying to make the following code run in parallel on separate
CPU cores but haven't had any success.
def make_links(self): for db in databases: link =
create_useful_link(self, Link, db) if link: scrape_db(self, link, db)
This is a web scraper which is working nicely in a leisurely
sequential manner. databases is a list of urls with gaps to be
filled by create_useful_link() which makes a link record from the
Link class. The self instance is a source of attributes for filling
the url gaps. self is a chemical substance and the link record url
field when clicked in a browser will bring up that external website
with the chemical substance selected for researching by the viewer.
If successful, we then fetch the external page and scrape a bunch
of interesting data from it and turn that into substance notes.
scrape_db() doesn't return anything but it does create up to nine
other records.
from joblib import Parallel, delayed
class Substance( etc ..
...
def make_links(self):
#Parallel(n_jobs=-2)(delayed(
# scrape_db(self, create_useful_link(self,
Link, db), db) for db in databases
#))
I'm getting a TypeError from Parallel delayed() - can't pickle
generator objects
So my question is how to write the commented code properly? I
suspect I haven't done enough comprehension.
Thanks for any help
Mike
_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug
_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug
_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug