Re: [Rdkit-discuss] Dividing inputstream over threads

2019-01-21 Thread Dmitri Maziuk via Rdkit-discuss
On Mon, 21 Jan 2019 09:43:48 +0100
Markus Sitzmann  wrote:
 
> There is no need for objects with SQLAlchemy, SQLAlchemy's Core and
> its expression language is pretty excellent without objects ...

I spent weeks last year rewriting code that I myself wrote back when I
believed that... When I wrote it originally, as I was getting deeper
in, SQLAlchemy changed my mind.

-- 
Dmitri Maziuk 


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Dividing inputstream over threads

2019-01-21 Thread Peter St. John
Another option is dask (https://docs.dask.org/en/latest/). I've used
`map_partitions` from dask to bulk convert a column of smiles strings into
various computed properties. You could then output to a CSV or other
database file.

-- Peter

On Mon, Jan 21, 2019 at 1:45 AM Markus Sitzmann 
wrote:

> > SQLalchemy creates a fairly specific ecosystem that you have to buy
> > into for it to make sense. When you don't have objects, only a table
> > of properties, OR mapper is just bloat.
>
> There is no need for objects with SQLAlchemy, SQLAlchemy's Core and its
> expression language is pretty excellent without objects ...
>
> >With parallel processing your bottleneck is going to be database
> >inserts. One option is write out CSV file(s) from each thread/job,
> >concatenate them in the final node, and then bulk-import into the
> >database: typically CSV (or other such format) bulk import is orders
> >of magnitude faster than inserting one SQL statement at a time.
>
> ... and bulk-inserts of Python data types into the database.
>
> Markus
>
> On Sun, Jan 20, 2019 at 9:17 PM Dmitri Maziuk via Rdkit-discuss <
> rdkit-discuss@lists.sourceforge.net> wrote:
>
>> On Sun, 20 Jan 2019 12:03:50 +0100
>> Shojiro Shibayama  wrote:
>>
>> > ... I guess SQLalchemy
>> > in python might be good, but I'm not sure. Hope that you'll find out
>> > a good library of SQL OR mapper for python.
>>
>> SQLalchemy creates a fairly specific ecosystem that you have to buy
>> into for it to make sense. When you don't have objects, only a table
>> of properties, OR mapper is just bloat.
>>
>> With parallel processing your bottleneck is going to be database
>> inserts. One option is write out CSV file(s) from each thread/job,
>> concatenate them in the final node, and then bulk-import into the
>> database: typically CSV (or other such format) bulk import is orders
>> of magnitude faster than inserting one SQL statement at a time.
>>
>> --
>> Dmitri Maziuk 
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Dividing inputstream over threads

2019-01-21 Thread Markus Sitzmann
> SQLalchemy creates a fairly specific ecosystem that you have to buy
> into for it to make sense. When you don't have objects, only a table
> of properties, OR mapper is just bloat.

There is no need for objects with SQLAlchemy, SQLAlchemy's Core and its
expression language is pretty excellent without objects ...

>With parallel processing your bottleneck is going to be database
>inserts. One option is write out CSV file(s) from each thread/job,
>concatenate them in the final node, and then bulk-import into the
>database: typically CSV (or other such format) bulk import is orders
>of magnitude faster than inserting one SQL statement at a time.

... and bulk-inserts of Python data types into the database.

Markus

On Sun, Jan 20, 2019 at 9:17 PM Dmitri Maziuk via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> On Sun, 20 Jan 2019 12:03:50 +0100
> Shojiro Shibayama  wrote:
>
> > ... I guess SQLalchemy
> > in python might be good, but I'm not sure. Hope that you'll find out
> > a good library of SQL OR mapper for python.
>
> SQLalchemy creates a fairly specific ecosystem that you have to buy
> into for it to make sense. When you don't have objects, only a table
> of properties, OR mapper is just bloat.
>
> With parallel processing your bottleneck is going to be database
> inserts. One option is write out CSV file(s) from each thread/job,
> concatenate them in the final node, and then bulk-import into the
> database: typically CSV (or other such format) bulk import is orders
> of magnitude faster than inserting one SQL statement at a time.
>
> --
> Dmitri Maziuk 
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Dividing inputstream over threads

2019-01-20 Thread Dmitri Maziuk via Rdkit-discuss
On Sun, 20 Jan 2019 12:03:50 +0100
Shojiro Shibayama  wrote:

> ... I guess SQLalchemy
> in python might be good, but I'm not sure. Hope that you'll find out
> a good library of SQL OR mapper for python.

SQLalchemy creates a fairly specific ecosystem that you have to buy
into for it to make sense. When you don't have objects, only a table
of properties, OR mapper is just bloat. 

With parallel processing your bottleneck is going to be database
inserts. One option is write out CSV file(s) from each thread/job,
concatenate them in the final node, and then bulk-import into the
database: typically CSV (or other such format) bulk import is orders
of magnitude faster than inserting one SQL statement at a time.

-- 
Dmitri Maziuk 


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Dividing inputstream over threads

2019-01-20 Thread Shojiro Shibayama
Hi,

A python standard library multiprocessing may help you to parallelize your
code.

I wrote a code that converts SMILES to hashed MorganFP using parallel
computation in the following short post. The code took 10 mins for 1.5m
compounds when 6 processes were used.
https://loudspeaker.sakura.ne.jp/devblog/2019/01/20/python-multiprocessing-write-strings-single/

multiprocessing.Pool.imap can be incorporated into for loop, which safely
accesses to a text file or even your SQL. I guess SQLalchemy in python
might be good, but I'm not sure. Hope that you'll find out a good library
of SQL OR mapper for python.

Sincerely yours,
Shojiro


On Tue, 15 Jan 2019, 01:54 Andreas Luttens  Hi!
>
> I have developed a small script that calculates molecules properties for
> molecules that are stored in a SMILES file. The properties should be stored
> in an SQL database, which works fine, but I would like to speed up the
> process a bit. I was thinking of implementing some parallelization for the
> calculating of properties and storing into separate connections to my SQL
> database. I have done this before in Python with OpenEye and seems to be
> doing the trick. I would however want my code to useable by people who do
> not hold a license for OpenEye, which is why I try RDKit. I would like my
> code to be in C++ as well.
>
> I was wondering how I would tackle this problem. Does the RDKit have a
> similar functionality as an "oemolithread" to chunk up the incoming stream?
> I haven't found something like this when I first scrolled through
> documentation. If it is not implemented, how would I divide the work on
> incoming molecules over N threads?
>
> All help is very appreciated. Thanks in advance.
>
> Best regards,
>
> Andreas Luttens
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Dividing inputstream over threads

2019-01-14 Thread Francois Berenger

On 15/01/2019 09:53, Andreas Luttens wrote:

Hi!

I have developed a small script that calculates molecules properties
for molecules that are stored in a SMILES file. The properties should
be stored in an SQL database, which works fine, but I would like to
speed up the process a bit. I was thinking of implementing some
parallelization for the calculating of properties and storing into
separate connections to my SQL database. I have done this before in
Python with OpenEye and seems to be doing the trick. I would however
want my code to useable by people who do not hold a license for
OpenEye, which is why I try RDKit. I would like my code to be in C++
as well.


In C++, you could use OpenMP and the parallel for pragma.


I was wondering how I would tackle this problem. Does the RDKit have a
similar functionality as an "oemolithread" to chunk up the incoming
stream? I haven't found something like this when I first scrolled
through documentation. If it is not implemented, how would I divide
the work on incoming molecules over N threads?

All help is very appreciated. Thanks in advance.

Best regards,

Andreas Luttens
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Dividing inputstream over threads

2019-01-14 Thread Andreas Luttens
Hi!

I have developed a small script that calculates molecules properties for
molecules that are stored in a SMILES file. The properties should be stored
in an SQL database, which works fine, but I would like to speed up the
process a bit. I was thinking of implementing some parallelization for the
calculating of properties and storing into separate connections to my SQL
database. I have done this before in Python with OpenEye and seems to be
doing the trick. I would however want my code to useable by people who do
not hold a license for OpenEye, which is why I try RDKit. I would like my
code to be in C++ as well.

I was wondering how I would tackle this problem. Does the RDKit have a
similar functionality as an "oemolithread" to chunk up the incoming stream?
I haven't found something like this when I first scrolled through
documentation. If it is not implemented, how would I divide the work on
incoming molecules over N threads?

All help is very appreciated. Thanks in advance.

Best regards,

Andreas Luttens
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss