Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-03-02 Thread Thomas Strunz
Hi Deepti,

for the spark part I say you simply generated all the fingerprints (locally or 
on the cluster) and store the generated list of fingerprints as pkl file. Then 
when running you test you simply load the picke file into memory. With 15GB 
memory and 2Mio molecules this should easily work out just fine, for a test 
obviously.
I have a simple web app that does exactly this albeit with only about 200k 
molecules using 400Mb of RAM which I assume most of it is from the 
fingerprints. This would mean the 2 mio fingerprints would only use about 4GB 
of RAM.

Still, it begs the question what this would be used for as obviously this 
approach doesn't scale at all and you would need some form of storing the 
fingerprints also on spark. Also if your goal is to do similarity searches with 
lots of fingerprints I suggest you have a look at ChemFP.

Best Regards,

Thomas


Von: Deepti Gupta via Rdkit-discuss 
Gesendet: Mittwoch, 26. Februar 2020 09:46
An: rdkit-discuss@lists.sourceforge.net ; 
Tim Dudgeon 
Betreff: Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

Hi Tim,

Thank you!

I'll be more detailed in my post, sorry about that. As this was a PoC, I had a 
spark cluster with 2 worker nodes with 4 vCPUs with disk size 500GB and memory 
15GB on Google Cloud. I timed the response against 2 million data points 
consisting of Chembl id, Smile structures.

Substructure search - 2 mins
Similarity search - 43 mins

PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500 GB and 
15GB memory. The value of shared_buffers = 2048MB  was edited in the  
postgresql.conf file.

Substructure search - within 5 secs
Similarity search - within 3 secs

I tried to store the converted molecules and fingerprints in a file to get 
better performance while trying the pyspark program but was not able to do so.

Regards,
DA

On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon 
 wrote:



I think you need to explain what benchmarks you are running and what is really 
meant by "faster".
And what hardware (for Spark how many nodes, how big; for PostgreSQL what size 
server, what settings esp. the shared_buffers setting).

A very obvious critique of what you reported is that what you describe as 
"running in Python" includes generating the fingerprints for each molecule on 
the fly, whereas for "the cartridge" these are already calculated, so will 
obviously be much faster (as the fingerprint generation dominates the compute).

Tim

On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
Hi Gurus,

I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where 
I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both 
and am trying some hands-on exercises to understand the differences. What I've 
understood that the structure searches are slower in Python (Spark Cluster) 
than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in 
this and maybe talking silly.

The similarity search using the below functions (example) -
Python methods -

fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False))
similarity = DataStructs.TanimotoSimilarity(fps1,fps2)

takes too long (45 minutes) for a 2 million file while the same thing is very 
quick (in seconds) on PostgreSQL
Database functions -

select count(*) from (select 
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
 as similarity from fingerprints join mols using (modality_id)) as fps where 
similarity between 0.45 and 0.50;

Does this conclude that for production workloads one must always use database 
cartridge only? Like RDKit, BINGO, etc.?

Regards,
DA




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-02-26 Thread Tim Dudgeon
Well, as I mentioned previously the big difference is because from 
Python you are iterating through the molecules, calculating the 
fingerprints and then doing a comparison on the fingerprints. Whereas in 
the PostgreSQL cartridge the fingerprints are already generated and 
indexed so the search is mostly about querying the index which will be 
very fast.


If you are repeatedly running queries against the same set of molecules 
then the cartridge will e the way to go. Doing ti procedurally from 
Python only really makes sense if you have a relatively small dataset 
and/or if the molecules you are searching are different every time.


In principle you should be able to cache the fingerprints in Python to 
avoid needing to recalculate them, but effectively you're implementing 
logic that is already present in the cartridge, and will be much more 
effective.


Tim

On 26/02/2020 08:46, Deepti Gupta wrote:

Hi Tim,

Thank you!

I'll be more detailed in my post, sorry about that. As this was a PoC, 
I had a spark cluster with 2 worker nodes with 4 vCPUs with disk size 
500GB and memory 15GB on Google Cloud. I timed the response against 2 
million data points consisting of Chembl id, Smile structures.


Substructure search - 2 mins
Similarity search - 43 mins

PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500 
GB and 15GB memory. The value of shared_buffers = 2048MB  was edited 
in the  postgresql.conf file.


Substructure search - within 5 secs
Similarity search - within 3 secs

I tried to store the converted molecules and fingerprints in a file to 
get better performance while trying the pyspark program but was not 
able to do so.


Regards,
DA

On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon 
 wrote:



I think you need to explain what benchmarks you are running and what 
is really meant by "faster".
And what hardware (for Spark how many nodes, how big; for PostgreSQL 
what size server, what settings esp. the shared_buffers setting).


A very obvious critique of what you reported is that what you describe 
as "running in Python" includes generating the fingerprints for each 
molecule on the fly, whereas for "the cartridge" these are already 
calculated, so will obviously be much faster (as the fingerprint 
generation dominates the compute).


Tim

On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
Hi Gurus,

I'm absolutely new to Chem-informatics domain. I've been assigned a 
PoC where I've to compare RDKit in Python and RDKit on PostgreSQL. 
I've installed both and am trying some hands-on exercises to 
understand the differences. What I've understood that the structure 
searches are slower in Python (Spark Cluster) than in PostgreSQL 
database. Please correct me if I'm wrong as I'm a newbie in this and 
maybe talking silly.


The similarity search using the below functions (example) -
Python methods -

fps = 
FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False))

similarity = DataStructs.TanimotoSimilarity(fps1,fps2)

takes too long (45 minutes) for a 2 million file while the same thing 
is very quick (in seconds) on PostgreSQL

Database functions -

select count(*) from (select 
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2) 
as similarity from fingerprints join mols using (modality_id)) as fps 
where similarity between 0.45 and 0.50;


Does this conclude that for production workloads one must always use 
database cartridge only? Like RDKit, BINGO, etc.?


Regards,
DA


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net  

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net 


https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-02-26 Thread Deepti Gupta via Rdkit-discuss
 Hi Tim,
Thank you!
I'll be more detailed in my post, sorry about that. As this was a PoC, I had a 
spark cluster with 2 worker nodes with 4 vCPUs with disk size 500GB and memory 
15GB on Google Cloud. I timed the response against 2 million data points 
consisting of Chembl id, Smile structures. 
Substructure search - 2 minsSimilarity search - 43 mins
PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500 GB and 
15GB memory. The value of shared_buffers = 2048MB  was edited in the  
postgresql.conf file.
Substructure search - within 5 secsSimilarity search - within 3 secs
I tried to store the converted molecules and fingerprints in a file to get 
better performance while trying the pyspark program but was not able to do so.
Regards,DA
On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon 
 wrote:  
 
  
I think you need to explain what benchmarks you are running and what is really 
meant by "faster".
 And what hardware (for Spark how many nodes, how big; for PostgreSQL what size 
server, what settings esp. the shared_buffers setting).
 
 
A very obvious critique of what you reported is that what you describe as 
"running in Python" includes generating the fingerprints for each molecule on 
the fly, whereas for "the cartridge" these are already calculated, so will 
obviously be much faster (as the fingerprint generation dominates the compute).
 
Tim
 
 On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
  
 
 Hi Gurus, 
  I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where 
I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both 
and am trying some hands-on exercises to understand the differences. What I've 
understood that the structure searches are slower in Python (Spark Cluster) 
than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in 
this and maybe talking silly. 
  The similarity search using the below functions (example) - Python methods - 
fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False)) similarity = DataStructs.TanimotoSimilarity(fps1,fps2)  
  takes too long (45 minutes) for a 2 million file while the same thing is very 
quick (in seconds) on PostgreSQL  Database functions - 
select count(*) from 
(selectmodality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
 as similarity from fingerprints join mols using (modality_id)) as fps where 
similarity between 0.45 and 0.50;  
  Does this conclude that for production workloads one must always use database 
cartridge only? Like RDKit, BINGO, etc.? 
  Regards, DA  
  
  ___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
 ___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
  ___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-02-25 Thread Tim Dudgeon
I think you need to explain what benchmarks you are running and what is 
really meant by "faster".
And what hardware (for Spark how many nodes, how big; for PostgreSQL 
what size server, what settings esp. the shared_buffers setting).


A very obvious critique of what you reported is that what you describe 
as "running in Python" includes generating the fingerprints for each 
molecule on the fly, whereas for "the cartridge" these are already 
calculated, so will obviously be much faster (as the fingerprint 
generation dominates the compute).


Tim

On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:

Hi Gurus,

I'm absolutely new to Chem-informatics domain. I've been assigned a 
PoC where I've to compare RDKit in Python and RDKit on PostgreSQL. 
I've installed both and am trying some hands-on exercises to 
understand the differences. What I've understood that the structure 
searches are slower in Python (Spark Cluster) than in PostgreSQL 
database. Please correct me if I'm wrong as I'm a newbie in this and 
maybe talking silly.


The similarity search using the below functions (example) -
Python methods -

fps = 
FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False))

similarity = DataStructs.TanimotoSimilarity(fps1,fps2)

takes too long (45 minutes) for a 2 million file while the same thing 
is very quick (in seconds) on PostgreSQL

Database functions -

select count(*) from (select 
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2) 
as similarity from fingerprints join mols using (modality_id)) as fps 
where similarity between 0.45 and 0.50;


Does this conclude that for production workloads one must always use 
database cartridge only? Like RDKit, BINGO, etc.?


Regards,
DA


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

2020-02-25 Thread Deepti Gupta via Rdkit-discuss
Hi Gurus,
I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where 
I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both 
and am trying some hands-on exercises to understand the differences. What I've 
understood that the structure searches are slower in Python (Spark Cluster) 
than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in 
this and maybe talking silly.
The similarity search using the below functions (example) -Python methods -
fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False))similarity = DataStructs.TanimotoSimilarity(fps1,fps2)
takes too long (45 minutes) for a 2 million file while the same thing is very 
quick (in seconds) on PostgreSQL Database functions -
select count(*) from (select 
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
 as similarity from fingerprints join mols using (modality_id)) as fps where 
similarity between 0.45 and 0.50;
Does this conclude that for production workloads one must always use database 
cartridge only? Like RDKit, BINGO, etc.?
Regards,DA___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss