Hi Alexis, if you go down that route and calculate artifical skeletons, you could also go all the way and use an algorithm like HierS [1] or the scaffold tree [2] to perform a recursive fragmentation of your queries and molecules into their various rings and ring systems. If a query contains a ring system that is not present in the molecule, it cannot be a substructure.
This is something you should be able to check with basic string matching or lookups in dictionaries / hashes instead of doing fingerprint calculations and comparisons. Not sure if that is actually faster, but might be worth a try. Hope this helps, Nils [1] https://pubs.acs.org/doi/abs/10.1021/jm049032d [2] https://pubs.acs.org/doi/10.1021/ci600338x Am 10.02.2020 um 21:01 schrieb Alexis Parenty: > Hi Maciek, thanks for your response. I did try that function too, but it > also takes smiles only (not smarts). I think the solution of Gregori is > very interesting: I am going to transform all smiles and smarts into > their single-bonded-carbon-based skeleton and will store the pattern > fingerprint of those skeletons in a dictionary using the smarts or the > smiles as a key. Then I will use your proposed function to match the > sub-skeletons with skeletons and will only do the expensive molecular > graph substructure search of the keys of the dictionary from which the > dictionary values have been identified as potential substructure of > others. Thanks Gregori! > Any other good tips? > Cheers, > Alexis > > On Mon, 10 Feb 2020 at 20:33, Maciek Wójcikowski <mac...@wojcikowski.pl > <mailto:mac...@wojcikowski.pl>> wrote: > > Alexis, > > I believe that `DataStructs.AllProbeBitsMatch(query_fp,mol_fp)` is > the function you are looking for here. More advanced usage and code > snippets you can find on RDKit blog post that Greg has put together > here: > https://rdkit.blogspot.com/2013/11/fingerprint-based-substructure.html > > Best, > Maciek > > ---- > Pozdrawiam, | Best regards, > Maciek Wójcikowski > mac...@wojcikowski.pl <mailto:mac...@wojcikowski.pl> > > > pon., 10 lut 2020 o 16:10 Alexis Parenty > <alexis.parenty.h...@gmail.com > <mailto:alexis.parenty.h...@gmail.com>> napisał(a): > > Dear Rdkiters, > > I am interested in doing substructure searches between many > thousands structures and many thousands of fragments, as quickly > as possible, with reasonable accuracy (> 0.95)... > > I did read Greg's excellent post on that subject: > > > http://rdkit.blogspot.com/2019/07/a-couple-of-substructure-search-topics.html > > I was using the rdkit pattern fingerprint approach to filter out > any fragments that have no chance of matching the bigger > structure through the slow and more accurate molecular graph > approach, saving a lot of time. > > However, I realized that this rdkit pattern fingerprint approach > only works well if we compared smiles with smiles: > > > > def frag_is_a_substructure_of_structure_via_pfp(frag*, *smiles): > pfp_frag = Chem.PatternFingerprint(Chem.MolFromSmiles(frag)) > pfp_structure = > Chem.PatternFingerprint(Chem.MolFromSmiles(smiles)) > > frag_bits = set(pfp_frag.GetOnBits()) > structure_bits = set(pfp_structure.GetOnBits()) > > if frag_bits.issubset(structure_bits): > return True > else: > return False > > > > Unfortunately, some of my fragments are Smarts that are not > valid Smiles: Using Chem.MolFromSmarts(smarts) gives really poor > result (Many False Positives leading to poor Specificity). > Interestingly, there is no False Negative, leading to a > Sensitivity of 1! > > > > def frag_is_a_substructure_of_structure_via_pfp(frag*, *smiles): > pfp_frag = Chem.PatternFingerprint(Chem.MolFromSmarts(frag)) > pfp_structure = > Chem.PatternFingerprint(Chem.MolFromSmiles(smiles)) > > frag_bits = set(pfp_frag.GetOnBits()) > structure_bits = set(pfp_structure.GetOnBits()) > > if frag_bits.issubset(structure_bits): > return True > else: > return False > > > > Is there a way to use pattern fingerprint (or other method) for > substructure searches independently of the Smiles/Smarts format > of the fragments? If not, is > mol_struct.HasSubstructMatch(mol_frag) the only way I am left with? > > Many thanks, > > Alexis > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > <mailto:Rdkit-discuss@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss