Re: [melbourne-pug] Joblib question

Mike Dewhirst Mon, 12 Mar 2018 21:02:13 -0700

On 10/03/2018 5:13 PM, Mike Dewhirst wrote:

I've run the process a couple of times and there doesn't seem to be anappreciable difference. Both methods take enough time to boil thekettle. I know that isn't proper testing. It might be difficult totest timing accurately when we are waiting on websites all over theworld to respond. Might set up a long running test to try and smoothout the differences.


Mmmmmmmm.

parallel.py:547: UserWarning: Multiprocessing-backed parallel loopscannot be nested below threads, setting n_jobs=1

  **self._backend_args)

I think I'll go back to sequential scraping and slap myself on the wristfor premature optimisation.

M

On 10/03/2018 5:04 PM, Mike Dewhirst wrote:
On 9/03/2018 7:30 PM, Alejandro Dubrovsky wrote:
delayed is a decorator, so it takes a function or a method. You arepassing it a generator instead.
def make_links(self):
Parallel(n_jobs=-2)(delayed(scrape_db)(self,create_useful_link(self, Link, db), db) for db in databases
)
should work,
Yes it does :) Thank you Alejandro
but it will only parallelise over the scrape_db calls, not thecreate_useful_link calls I think. Which of the two do you want toparallelise over? Or were you after parallelising both?
I think I probably want to use Celery (thanks Ed for the suggestion)or similar so I can loop through (currently) nine databases and kickoff a scrape_db() task for each. Then each scrape_db task looks for(currently) ten data items of specific interest. Having scraped adata item we need to get_or_create (this is in Django) the specificdata note and add the result to whatever is there.
That data note update might be a bottleneck with more than onescrape_db task in parallel retrieving the same data item; say aqueoussolubility. We want aqueous solubility from all databases in the samenote so the user can easily compare different values and decide whichvalue to use.
So parallelising everything might eventually be somewhat problematic.It all has to squeeze through Postgres atomic transactions right atthe end. I suppose this is a perfect example of an IO bound task.
Also, another thing is that the app is (currently) all server side.I'm not (yet) using AJAX to update the screen when the data becomesavailable.
Cheers

Mike
On 09/03/18 18:41, Mike Dewhirst wrote:
https://media.readthedocs.org/pdf/joblib/latest/joblib.pdf
I'm trying to make the following code run in parallel on separateCPU cores but haven't had any success.
def make_links(self): for db in databases: link =create_useful_link(self, Link, db) if link: scrape_db(self, link, db)
This is a web scraper which is working nicely in a leisurelysequential manner. databases is a list of urls with gaps to befilled by create_useful_link() which makes a link record from theLink class. The self instance is a source of attributes for fillingthe url gaps. self is a chemical substance and the link record urlfield when clicked in a browser will bring up that external websitewith the chemical substance selected for researching by the viewer.If successful, we then fetch the external page and scrape a bunchof interesting data from it and turn that into substance notes.scrape_db() doesn't return anything but it does create up to nineother records.
         from joblib import Parallel, delayed

         class Substance( etc ..
             ...
             def make_links(self):
                 #Parallel(n_jobs=-2)(delayed(
# scrape_db(self, create_useful_link(self,Link, db), db) for db in databases
                 #))
I'm getting a TypeError from Parallel delayed() - can't picklegenerator objects
So my question is how to write the commented code properly? Isuspect I haven't done enough comprehension.
Thanks for any help

Mike


_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug
_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug


_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug

Re: [melbourne-pug] Joblib question

Reply via email to