Re: [Rdkit-discuss] Dividing inputstream over threads
On Mon, 21 Jan 2019 09:43:48 +0100 Markus Sitzmann wrote: > There is no need for objects with SQLAlchemy, SQLAlchemy's Core and > its expression language is pretty excellent without objects ... I spent weeks last year rewriting code that I myself wrote back when I believed that... When I wrote it originally, as I was getting deeper in, SQLAlchemy changed my mind. -- Dmitri Maziuk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Dividing inputstream over threads
Another option is dask (https://docs.dask.org/en/latest/). I've used `map_partitions` from dask to bulk convert a column of smiles strings into various computed properties. You could then output to a CSV or other database file. -- Peter On Mon, Jan 21, 2019 at 1:45 AM Markus Sitzmann wrote: > > SQLalchemy creates a fairly specific ecosystem that you have to buy > > into for it to make sense. When you don't have objects, only a table > > of properties, OR mapper is just bloat. > > There is no need for objects with SQLAlchemy, SQLAlchemy's Core and its > expression language is pretty excellent without objects ... > > >With parallel processing your bottleneck is going to be database > >inserts. One option is write out CSV file(s) from each thread/job, > >concatenate them in the final node, and then bulk-import into the > >database: typically CSV (or other such format) bulk import is orders > >of magnitude faster than inserting one SQL statement at a time. > > ... and bulk-inserts of Python data types into the database. > > Markus > > On Sun, Jan 20, 2019 at 9:17 PM Dmitri Maziuk via Rdkit-discuss < > rdkit-discuss@lists.sourceforge.net> wrote: > >> On Sun, 20 Jan 2019 12:03:50 +0100 >> Shojiro Shibayama wrote: >> >> > ... I guess SQLalchemy >> > in python might be good, but I'm not sure. Hope that you'll find out >> > a good library of SQL OR mapper for python. >> >> SQLalchemy creates a fairly specific ecosystem that you have to buy >> into for it to make sense. When you don't have objects, only a table >> of properties, OR mapper is just bloat. >> >> With parallel processing your bottleneck is going to be database >> inserts. One option is write out CSV file(s) from each thread/job, >> concatenate them in the final node, and then bulk-import into the >> database: typically CSV (or other such format) bulk import is orders >> of magnitude faster than inserting one SQL statement at a time. >> >> -- >> Dmitri Maziuk >> >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Dividing inputstream over threads
> SQLalchemy creates a fairly specific ecosystem that you have to buy > into for it to make sense. When you don't have objects, only a table > of properties, OR mapper is just bloat. There is no need for objects with SQLAlchemy, SQLAlchemy's Core and its expression language is pretty excellent without objects ... >With parallel processing your bottleneck is going to be database >inserts. One option is write out CSV file(s) from each thread/job, >concatenate them in the final node, and then bulk-import into the >database: typically CSV (or other such format) bulk import is orders >of magnitude faster than inserting one SQL statement at a time. ... and bulk-inserts of Python data types into the database. Markus On Sun, Jan 20, 2019 at 9:17 PM Dmitri Maziuk via Rdkit-discuss < rdkit-discuss@lists.sourceforge.net> wrote: > On Sun, 20 Jan 2019 12:03:50 +0100 > Shojiro Shibayama wrote: > > > ... I guess SQLalchemy > > in python might be good, but I'm not sure. Hope that you'll find out > > a good library of SQL OR mapper for python. > > SQLalchemy creates a fairly specific ecosystem that you have to buy > into for it to make sense. When you don't have objects, only a table > of properties, OR mapper is just bloat. > > With parallel processing your bottleneck is going to be database > inserts. One option is write out CSV file(s) from each thread/job, > concatenate them in the final node, and then bulk-import into the > database: typically CSV (or other such format) bulk import is orders > of magnitude faster than inserting one SQL statement at a time. > > -- > Dmitri Maziuk > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Dividing inputstream over threads
On Sun, 20 Jan 2019 12:03:50 +0100 Shojiro Shibayama wrote: > ... I guess SQLalchemy > in python might be good, but I'm not sure. Hope that you'll find out > a good library of SQL OR mapper for python. SQLalchemy creates a fairly specific ecosystem that you have to buy into for it to make sense. When you don't have objects, only a table of properties, OR mapper is just bloat. With parallel processing your bottleneck is going to be database inserts. One option is write out CSV file(s) from each thread/job, concatenate them in the final node, and then bulk-import into the database: typically CSV (or other such format) bulk import is orders of magnitude faster than inserting one SQL statement at a time. -- Dmitri Maziuk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Dividing inputstream over threads
Hi, A python standard library multiprocessing may help you to parallelize your code. I wrote a code that converts SMILES to hashed MorganFP using parallel computation in the following short post. The code took 10 mins for 1.5m compounds when 6 processes were used. https://loudspeaker.sakura.ne.jp/devblog/2019/01/20/python-multiprocessing-write-strings-single/ multiprocessing.Pool.imap can be incorporated into for loop, which safely accesses to a text file or even your SQL. I guess SQLalchemy in python might be good, but I'm not sure. Hope that you'll find out a good library of SQL OR mapper for python. Sincerely yours, Shojiro On Tue, 15 Jan 2019, 01:54 Andreas Luttens Hi! > > I have developed a small script that calculates molecules properties for > molecules that are stored in a SMILES file. The properties should be stored > in an SQL database, which works fine, but I would like to speed up the > process a bit. I was thinking of implementing some parallelization for the > calculating of properties and storing into separate connections to my SQL > database. I have done this before in Python with OpenEye and seems to be > doing the trick. I would however want my code to useable by people who do > not hold a license for OpenEye, which is why I try RDKit. I would like my > code to be in C++ as well. > > I was wondering how I would tackle this problem. Does the RDKit have a > similar functionality as an "oemolithread" to chunk up the incoming stream? > I haven't found something like this when I first scrolled through > documentation. If it is not implemented, how would I divide the work on > incoming molecules over N threads? > > All help is very appreciated. Thanks in advance. > > Best regards, > > Andreas Luttens > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Dividing inputstream over threads
On 15/01/2019 09:53, Andreas Luttens wrote: Hi! I have developed a small script that calculates molecules properties for molecules that are stored in a SMILES file. The properties should be stored in an SQL database, which works fine, but I would like to speed up the process a bit. I was thinking of implementing some parallelization for the calculating of properties and storing into separate connections to my SQL database. I have done this before in Python with OpenEye and seems to be doing the trick. I would however want my code to useable by people who do not hold a license for OpenEye, which is why I try RDKit. I would like my code to be in C++ as well. In C++, you could use OpenMP and the parallel for pragma. I was wondering how I would tackle this problem. Does the RDKit have a similar functionality as an "oemolithread" to chunk up the incoming stream? I haven't found something like this when I first scrolled through documentation. If it is not implemented, how would I divide the work on incoming molecules over N threads? All help is very appreciated. Thanks in advance. Best regards, Andreas Luttens ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Dividing inputstream over threads
Hi! I have developed a small script that calculates molecules properties for molecules that are stored in a SMILES file. The properties should be stored in an SQL database, which works fine, but I would like to speed up the process a bit. I was thinking of implementing some parallelization for the calculating of properties and storing into separate connections to my SQL database. I have done this before in Python with OpenEye and seems to be doing the trick. I would however want my code to useable by people who do not hold a license for OpenEye, which is why I try RDKit. I would like my code to be in C++ as well. I was wondering how I would tackle this problem. Does the RDKit have a similar functionality as an "oemolithread" to chunk up the incoming stream? I haven't found something like this when I first scrolled through documentation. If it is not implemented, how would I divide the work on incoming molecules over N threads? All help is very appreciated. Thanks in advance. Best regards, Andreas Luttens ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss