I think you need to explain what benchmarks you are running and what is really meant by "faster". And what hardware (for Spark how many nodes, how big; for PostgreSQL what size server, what settings esp. the shared_buffers setting).

A very obvious critique of what you reported is that what you describe as "running in Python" includes generating the fingerprints for each molecule on the fly, whereas for "the cartridge" these are already calculated, so will obviously be much faster (as the fingerprint generation dominates the compute).


On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
Hi Gurus,

I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both and am trying some hands-on exercises to understand the differences. What I've understood that the structure searches are slower in Python (Spark Cluster) than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in this and maybe talking silly.

The similarity search using the below functions (example) -
Python methods -

fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, sanitize=False))
similarity = DataStructs.TanimotoSimilarity(fps1,fps2)

takes too long (45 minutes) for a 2 million file while the same thing is very quick (in seconds) on PostgreSQL
Database functions -

select count(*) from (select modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2) as similarity from fingerprints join mols using (modality_id)) as fps where similarity between 0.45 and 0.50;

Does this conclude that for production workloads one must always use database cartridge only? Like RDKit, BINGO, etc.?


Rdkit-discuss mailing list
Rdkit-discuss mailing list

Reply via email to