Well, as I mentioned previously the big difference is because from Python you are iterating through the molecules, calculating the fingerprints and then doing a comparison on the fingerprints. Whereas in the PostgreSQL cartridge the fingerprints are already generated and indexed so the search is mostly about querying the index which will be very fast.

If you are repeatedly running queries against the same set of molecules then the cartridge will e the way to go. Doing ti procedurally from Python only really makes sense if you have a relatively small dataset and/or if the molecules you are searching are different every time.

In principle you should be able to cache the fingerprints in Python to avoid needing to recalculate them, but effectively you're implementing logic that is already present in the cartridge, and will be much more effective.

Tim

On 26/02/2020 08:46, Deepti Gupta wrote:
Hi Tim,

Thank you!

I'll be more detailed in my post, sorry about that. As this was a PoC, I had a spark cluster with 2 worker nodes with 4 vCPUs with disk size 500GB and memory 15GB on Google Cloud. I timed the response against 2 million data points consisting of Chembl id, Smile structures.

Substructure search - 2 mins
Similarity search - 43 mins

PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500 GB and 15GB memory. The value of shared_buffers = 2048MB  was edited in the  postgresql.conf file.

Substructure search - within 5 secs
Similarity search - within 3 secs

I tried to store the converted molecules and fingerprints in a file to get better performance while trying the pyspark program but was not able to do so.

Regards,
DA

On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon <tdudgeon...@gmail.com> wrote:


I think you need to explain what benchmarks you are running and what is really meant by "faster". And what hardware (for Spark how many nodes, how big; for PostgreSQL what size server, what settings esp. the shared_buffers setting).

A very obvious critique of what you reported is that what you describe as "running in Python" includes generating the fingerprints for each molecule on the fly, whereas for "the cartridge" these are already calculated, so will obviously be much faster (as the fingerprint generation dominates the compute).

Tim

On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
Hi Gurus,

I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both and am trying some hands-on exercises to understand the differences. What I've understood that the structure searches are slower in Python (Spark Cluster) than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in this and maybe talking silly.

The similarity search using the below functions (example) -
Python methods -

fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, sanitize=False))
similarity = DataStructs.TanimotoSimilarity(fps1,fps2)

takes too long (45 minutes) for a 2 million file while the same thing is very quick (in seconds) on PostgreSQL
Database functions -

select count(*) from (select modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2) as similarity from fingerprints join mols using (modality_id)) as fps where similarity between 0.45 and 0.50;

Does this conclude that for production workloads one must always use database cartridge only? Like RDKit, BINGO, etc.?

Regards,
DA


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net  
<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net <mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to