Dear RDKit-experts, I'm using RDKit to search substructures in molecular structures. I used Chem.MolFromSmiles() for my substructure search and was wondering why the substructure was not found in some structures. On Chemistry.StackExchange I got a helpful hint. And now, I guess, I better understand the difference between SMILES and SMARTS.
The following example: I guess I cannot attach images here. So, for a visualization please check (https://chemistry.stackexchange.com/q/128440/81125) The first SMILES is searched in the other structures. You will find molecules 2 and 4, but not 3 and 5. Code: ### substructure search with RDKit from rdkit import Chem smiles_list = ['C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=CC=C6', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C=C4', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CCCCC7=C6', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CC8=CC=CC=C8CC7=C6'] pattern = Chem.MolFromSmiles(smiles_list[0]) for idx,smiles in enumerate(smiles_list): m = Chem.MolFromSmiles(smiles) print("Structure {}: pattern found {}".format(idx+1,m.HasSubstructMatch(pattern))) ### end of code Result: Structure 1: pattern found True Structure 2: pattern found True Structure 3: pattern found False Structure 4: pattern found True Structure 5: pattern found False The solution I have come up so far is the following: (see also https://chemistry.stackexchange.com/a/128453/81125) Basically, you convert the the search-SMILES 'C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4' to mol and this mol via Chem.MolToSmiles() back to SMILES again, you get c1ccc2c(c1)-c1cccc3cccc-2c13. If you create your search pattern via Chem.MolFromSmarts(), you will still not find structures 3 and 5. Probably because of the defined single bonds. However, if you replace - by ~ in, you get smiles_1b: c1ccc2c(c1)~c1cccc3cccc~2c13. With this, you will find also structures 3 and 5. Code: (I also added Benzene as structure 6 to have a non-match) ### substructure search with RDKit from rdkit import Chem smiles_list = ['C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=CC=C6', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C=C4', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CCCCC7=C6', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CC8=CC=CC=C8CC7=C6','c1ccccc1'] def search_structure(pattern): for idx,smiles in enumerate(smiles_list): m = Chem.MolFromSmiles(smiles) print("Structure {}: pattern found {}".format(idx+1,m.HasSubstructMatch(pattern))) smiles_1a = smiles_list[0] pattern_1a = Chem.MolFromSmiles(smiles_1a) smiles_1b = Chem.MolToSmiles(pattern_1a).replace('-','~') # replace bonds pattern_1b = Chem.MolFromSmarts(smiles_1b) print("\nSMILES 1a: {}".format(smiles_1a)) search_structure(pattern_1a) print("\nSMILES 1b: {}".format(smiles_1b)) search_structure(pattern_1b) ### end of code Result: SMILES 1a: C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4 Structure 1: pattern found True Structure 2: pattern found True Structure 3: pattern found False Structure 4: pattern found True Structure 5: pattern found False Structure 6: pattern found False SMILES 1b: c1ccc2c(c1)~c1cccc3cccc~2c13 Structure 1: pattern found True Structure 2: pattern found True Structure 3: pattern found True Structure 4: pattern found True Structure 5: pattern found True Structure 6: pattern found False My question is now: is this the way to go or could this lead to other surprises or unexpected results? Please excuse that I'm asking the same question in the WWW twice, but I guess this is the primary place to ask. Best regards, Theo. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss