Well, as I mentioned previously the big difference is because from
Python you are iterating through the molecules, calculating the
fingerprints and then doing a comparison on the fingerprints. Whereas in
the PostgreSQL cartridge the fingerprints are already generated and
indexed so the search is mostly about querying the index which will be
very fast.
If you are repeatedly running queries against the same set of molecules
then the cartridge will e the way to go. Doing ti procedurally from
Python only really makes sense if you have a relatively small dataset
and/or if the molecules you are searching are different every time.
In principle you should be able to cache the fingerprints in Python to
avoid needing to recalculate them, but effectively you're implementing
logic that is already present in the cartridge, and will be much more
effective.
Tim
On 26/02/2020 08:46, Deepti Gupta wrote:
Hi Tim,
Thank you!
I'll be more detailed in my post, sorry about that. As this was a PoC,
I had a spark cluster with 2 worker nodes with 4 vCPUs with disk size
500GB and memory 15GB on Google Cloud. I timed the response against 2
million data points consisting of Chembl id, Smile structures.
Substructure search - 2 mins
Similarity search - 43 mins
PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500
GB and 15GB memory. The value of shared_buffers = 2048MB was edited
in the postgresql.conf file.
Substructure search - within 5 secs
Similarity search - within 3 secs
I tried to store the converted molecules and fingerprints in a file to
get better performance while trying the pyspark program but was not
able to do so.
Regards,
DA
On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon
<tdudgeon...@gmail.com> wrote:
I think you need to explain what benchmarks you are running and what
is really meant by "faster".
And what hardware (for Spark how many nodes, how big; for PostgreSQL
what size server, what settings esp. the shared_buffers setting).
A very obvious critique of what you reported is that what you describe
as "running in Python" includes generating the fingerprints for each
molecule on the fly, whereas for "the cartridge" these are already
calculated, so will obviously be much faster (as the fingerprint
generation dominates the compute).
Tim
On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
Hi Gurus,
I'm absolutely new to Chem-informatics domain. I've been assigned a
PoC where I've to compare RDKit in Python and RDKit on PostgreSQL.
I've installed both and am trying some hands-on exercises to
understand the differences. What I've understood that the structure
searches are slower in Python (Spark Cluster) than in PostgreSQL
database. Please correct me if I'm wrong as I'm a newbie in this and
maybe talking silly.
The similarity search using the below functions (example) -
Python methods -
fps =
FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure,
sanitize=False))
similarity = DataStructs.TanimotoSimilarity(fps1,fps2)
takes too long (45 minutes) for a 2 million file while the same thing
is very quick (in seconds) on PostgreSQL
Database functions -
select count(*) from (select
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
as similarity from fingerprints join mols using (modality_id)) as fps
where similarity between 0.45 and 0.50;
Does this conclude that for production workloads one must always use
database cartridge only? Like RDKit, BINGO, etc.?
Regards,
DA
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss