Hi Gurus,
I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where
I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both
and am trying some hands-on exercises to understand the differences. What I've
understood that the structure searches are slower in Python (Spark Cluster)
than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in
this and maybe talking silly.
The similarity search using the below functions (example) -Python methods -
fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure,
sanitize=False))similarity = DataStructs.TanimotoSimilarity(fps1,fps2)
takes too long (45 minutes) for a 2 million file while the same thing is very
quick (in seconds) on PostgreSQLÂ Database functions -
select count(*) from (select
modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
as similarity from fingerprints join mols using (modality_id)) as fps where
similarity between 0.45 and 0.50;
Does this conclude that for production workloads one must always use database
cartridge only? Like RDKit, BINGO, etc.?
Regards,DA
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss