Hi Alexis, Knowing what you want to achieve, I would take the problem the other way around. Instead of matching your many fragments to your input structure, I would rather apply the same transformation(s) you apply to your fragments to your input structure. I know that you replace all non-hydrogen atoms by "any" atoms, and all single/double/triple bonds by "any" bonds; you could store a list of fragments where all non-hydrogen atoms are replaced by carbons, and all bonds by single bonds; you calculate and store the fingerprints of these fragments. Finally you apply the same transformation to your input structure, calculate the fingerprint, and do your substructure search.
Best, Grégori On Monday, February 10, 2020 16:08 CET, Alexis Parenty <alexis.parenty.h...@gmail.com> wrote: Dear Rdkiters, I am interested in doing substructure searches between many thousands structures and many thousands of fragments, as quickly as possible, with reasonable accuracy (> 0.95)... I did read Greg's excellent post on that subject: http://rdkit.blogspot.com/2019/07/a-couple-of-substructure-search-topics.html I was using the rdkit pattern fingerprint approach to filter out any fragments that have no chance of matching the bigger structure through the slow and more accurate molecular graph approach, saving a lot of time. However, I realized that this rdkit pattern fingerprint approach only works well if we compared smiles with smiles: def frag_is_a_substructure_of_structure_via_pfp(frag, smiles): pfp_frag = Chem.PatternFingerprint(Chem.MolFromSmiles(frag)) pfp_structure = Chem.PatternFingerprint(Chem.MolFromSmiles(smiles)) frag_bits = set(pfp_frag.GetOnBits()) structure_bits = set(pfp_structure.GetOnBits()) if frag_bits.issubset(structure_bits): return True else: return False Unfortunately, some of my fragments are Smarts that are not valid Smiles: Using Chem.MolFromSmarts(smarts) gives really poor result (Many False Positives leading to poor Specificity). Interestingly, there is no False Negative, leading to a Sensitivity of 1! def frag_is_a_substructure_of_structure_via_pfp(frag, smiles): pfp_frag = Chem.PatternFingerprint(Chem.MolFromSmarts(frag)) pfp_structure = Chem.PatternFingerprint(Chem.MolFromSmiles(smiles)) frag_bits = set(pfp_frag.GetOnBits()) structure_bits = set(pfp_structure.GetOnBits()) if frag_bits.issubset(structure_bits): return True else: return False Is there a way to use pattern fingerprint (or other method) for substructure searches independently of the Smiles/Smarts format of the fragments? If not, is mol_struct.HasSubstructMatch(mol_frag) the only way I am left with? Many thanks, Alexis
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss