On 10/03/2018 5:13 PM, Mike Dewhirst wrote:
I've run the process a couple of times and there doesn't seem to be an appreciable difference. Both methods take enough time to boil the kettle. I know that isn't proper testing. It might be difficult to test timing accurately when we are waiting on websites all over the world to respond. Might set up a long running test to try and smooth out the differences.

Mmmmmmmm.

parallel.py:547: UserWarning: Multiprocessing-backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)

I think I'll go back to sequential scraping and slap myself on the wrist for premature optimisation.


M

On 10/03/2018 5:04 PM, Mike Dewhirst wrote:
On 9/03/2018 7:30 PM, Alejandro Dubrovsky wrote:
delayed is a decorator, so it takes a function or a method. You are passing it a generator instead.

def make_links(self):
    Parallel(n_jobs=-2)(delayed(scrape_db)(self, create_useful_link(self, Link, db), db) for db in databases
)

should work,

Yes it does :) Thank you Alejandro

but it will only parallelise over the scrape_db calls, not the create_useful_link calls I think. Which of the two do you want to parallelise over? Or were you after parallelising both?

I think I probably want to use Celery (thanks Ed for the suggestion) or similar so I can loop through (currently) nine databases and kick off a scrape_db() task for each. Then each scrape_db task looks for (currently) ten data items of specific interest. Having scraped a data item we need to get_or_create (this is in Django) the specific data note and add the result to whatever is there.

That data note update might be a bottleneck with more than one scrape_db task in parallel retrieving the same data item; say aqueous solubility. We want aqueous solubility from all databases in the same note so the user can easily compare different values and decide which value to use.

So parallelising everything might eventually be somewhat problematic. It all has to squeeze through Postgres atomic transactions right at the end. I suppose this is a perfect example of an IO bound task.

Also, another thing is that the app is (currently) all server side. I'm not (yet) using AJAX to update the screen when the data becomes available.

Cheers

Mike


On 09/03/18 18:41, Mike Dewhirst wrote:
https://media.readthedocs.org/pdf/joblib/latest/joblib.pdf

I'm trying to make the following code run in parallel on separate CPU cores but haven't had any success.

def make_links(self): for db in databases: link = create_useful_link(self, Link, db) if link: scrape_db(self, link, db)

This is a web scraper which is working nicely in a leisurely sequential manner.  databases is a list of urls with gaps to be filled by create_useful_link() which makes a link record from the Link class. The self instance is a source of attributes for filling the url gaps. self is a chemical substance and the link record url field when clicked in a browser will bring up that external website with the chemical substance selected for researching by the viewer. If successful, we then fetch the external page and scrape a bunch of interesting data from it and turn that into substance notes. scrape_db() doesn't return anything but it does create up to nine other records.

         from joblib import Parallel, delayed

         class Substance( etc ..
             ...
             def make_links(self):
                 #Parallel(n_jobs=-2)(delayed(
                 #    scrape_db(self, create_useful_link(self, Link, db), db) for db in databases
                 #))

I'm getting a TypeError from Parallel delayed() - can't pickle generator objects

So my question is how to write the commented code properly? I suspect I haven't done enough comprehension.

Thanks for any help

Mike


_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug


_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug




_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug

Reply via email to