I think you need to explain what benchmarks you are running and what is
really meant by "faster".
And what hardware (for Spark how many nodes, how big; for PostgreSQL
what size server, what settings esp. the shared_buffers setting).
A very obvious critique of what you reported is that what you describe
as "running in Python" includes generating the fingerprints for each
molecule on the fly, whereas for "the cartridge" these are already
calculated, so will obviously be much faster (as the fingerprint
generation dominates the compute).
Tim
On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
Hi Gurus,
I'm absolutely new to Chem-informatics domain. I've been assigned a
PoC where I've to compare RDKit in Python and RDKit on PostgreSQL.
I've installed both and am trying some hands-on exercises to
understand the differences. What I've understood that the structure
searches are slower in Python (Spark Cluster) than in PostgreSQL
database. Please correct me if I'm wrong as I'm a newbie in this and
maybe talking silly.
The similarity search using the below functions (example) -
Python methods -
fps =
FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure,
sanitize=False))
similarity = DataStructs.TanimotoSimilarity(fps1,fps2)
takes too long (45 minutes) for a 2 million file while the same thing
is very quick (in seconds) on PostgreSQL
Database functions -
select count(*) from (select
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
as similarity from fingerprints join mols using (modality_id)) as fps
where similarity between 0.45 and 0.50;
Does this conclude that for production workloads one must always use
database cartridge only? Like RDKit, BINGO, etc.?
Regards,
DA
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss