Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

Tim Dudgeon Wed, 26 Feb 2020 06:01:54 -0800

Well, as I mentioned previously the big difference is because fromPython you are iterating through the molecules, calculating thefingerprints and then doing a comparison on the fingerprints. Whereas inthe PostgreSQL cartridge the fingerprints are already generated andindexed so the search is mostly about querying the index which will bevery fast.

If you are repeatedly running queries against the same set of moleculesthen the cartridge will e the way to go. Doing ti procedurally fromPython only really makes sense if you have a relatively small datasetand/or if the molecules you are searching are different every time.

In principle you should be able to cache the fingerprints in Python toavoid needing to recalculate them, but effectively you're implementinglogic that is already present in the cartridge, and will be much moreeffective.


Tim

On 26/02/2020 08:46, Deepti Gupta wrote:

Hi Tim,

Thank you!
I'll be more detailed in my post, sorry about that. As this was a PoC,I had a spark cluster with 2 worker nodes with 4 vCPUs with disk size500GB and memory 15GB on Google Cloud. I timed the response against 2million data points consisting of Chembl id, Smile structures.
Substructure search - 2 mins
Similarity search - 43 mins
PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500GB and 15GB memory. The value of shared_buffers = 2048MB was editedin the postgresql.conf file.
Substructure search - within 5 secs
Similarity search - within 3 secs
I tried to store the converted molecules and fingerprints in a file toget better performance while trying the pyspark program but was notable to do so.
Regards,
DA
On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon<[email protected]> wrote:
I think you need to explain what benchmarks you are running and whatis really meant by "faster".And what hardware (for Spark how many nodes, how big; for PostgreSQLwhat size server, what settings esp. the shared_buffers setting).
A very obvious critique of what you reported is that what you describeas "running in Python" includes generating the fingerprints for eachmolecule on the fly, whereas for "the cartridge" these are alreadycalculated, so will obviously be much faster (as the fingerprintgeneration dominates the compute).
Tim

On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
Hi Gurus,
I'm absolutely new to Chem-informatics domain. I've been assigned aPoC where I've to compare RDKit in Python and RDKit on PostgreSQL.I've installed both and am trying some hands-on exercises tounderstand the differences. What I've understood that the structuresearches are slower in Python (Spark Cluster) than in PostgreSQLdatabase. Please correct me if I'm wrong as I'm a newbie in this andmaybe talking silly.
The similarity search using the below functions (example) -
Python methods -
fps =FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure,sanitize=False))
similarity = DataStructs.TanimotoSimilarity(fps1,fps2)
takes too long (45 minutes) for a 2 million file while the same thingis very quick (in seconds) on PostgreSQL
Database functions -
select count(*) from (selectmodality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)as similarity from fingerprints join mols using (modality_id)) as fpswhere similarity between 0.45 and 0.50;
Does this conclude that for production workloads one must always usedatabase cartridge only? Like RDKit, BINGO, etc.?
Regards,
DA


_______________________________________________
Rdkit-discuss mailing list
[email protected]  
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL?

Reply via email to