Dear Rdkiters, I am interested in doing substructure searches between many thousands structures and many thousands of fragments, as quickly as possible, with reasonable accuracy (> 0.95)...
I did read Greg's excellent post on that subject: http://rdkit.blogspot.com/2019/07/a-couple-of-substructure-search-topics.html I was using the rdkit pattern fingerprint approach to filter out any fragments that have no chance of matching the bigger structure through the slow and more accurate molecular graph approach, saving a lot of time. However, I realized that this rdkit pattern fingerprint approach only works well if we compared smiles with smiles: def frag_is_a_substructure_of_structure_via_pfp(frag*, *smiles): pfp_frag = Chem.PatternFingerprint(Chem.MolFromSmiles(frag)) pfp_structure = Chem.PatternFingerprint(Chem.MolFromSmiles(smiles)) frag_bits = set(pfp_frag.GetOnBits()) structure_bits = set(pfp_structure.GetOnBits()) if frag_bits.issubset(structure_bits): return True else: return False Unfortunately, some of my fragments are Smarts that are not valid Smiles: Using Chem.MolFromSmarts(smarts) gives really poor result (Many False Positives leading to poor Specificity). Interestingly, there is no False Negative, leading to a Sensitivity of 1! def frag_is_a_substructure_of_structure_via_pfp(frag*, *smiles): pfp_frag = Chem.PatternFingerprint(Chem.MolFromSmarts(frag)) pfp_structure = Chem.PatternFingerprint(Chem.MolFromSmiles(smiles)) frag_bits = set(pfp_frag.GetOnBits()) structure_bits = set(pfp_structure.GetOnBits()) if frag_bits.issubset(structure_bits): return True else: return False Is there a way to use pattern fingerprint (or other method) for substructure searches independently of the Smiles/Smarts format of the fragments? If not, is mol_struct.HasSubstructMatch(mol_frag) the only way I am left with? Many thanks, Alexis
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss