Pavel, this is a bit hacky, but you can try the below: ``` def get_frag_smi(mol,frag_atoms): if len(frag_atoms) > 1: b2b = [] # bonds to break fsmi = "" #fragment smiles # get bonds outside of fragment for b in mol.GetBonds(): b_idx = b.GetBeginAtomIdx() e_idx = b.GetEndAtomIdx() if e_idx not in frag_atoms\ or b_idx not in frag_atoms: b2b.append(b.GetIdx()) # break all bonds except those in fragments fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0) smis = Chem.MolToSmiles(fmol).split(".") # retain the only fragment with more than one atom in there while fsmi == "": smi = smis.pop(0) m = Chem.MolFromSmiles(smi,sanitize=False) if len(m.GetAtoms()) > 1: fsmi = smi else: #one atom, no canonicalize needed fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms) return fsmi ``` it is based on the observation/assumption that FragmentOnBonds() and then MolToSmiles() canonizes the fragments cleanly. > print(get_frag_smi(mol,[1,2,3,17])) > print(get_frag_smi(mol,[9,10,11,12])) prints `cN(c)O` twice.
best wishes, wim On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk <pavel_polishc...@ukr.net> wrote: > Hello, > > I encountered an issue with SMILES of fragments. Maybe someone may > suggest a workaround. > I attached the notebook, but will also reproduce some code here. > > We have a structure with two Ns and we take an N atom and adjacent atoms > to make a fragment SMILES and got different results, while SMILES represent > the same pattern (only the order of atoms is different). I guess this > happens due to canonicalization algorithm, which takes into account some > additional information missing in the output SMILES (e.g. ring membership). > For instance, if we break a saturated cycle (bond 8-9), we get identical > SMILES output. > > mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12') > > > print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True)) > print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True)) > > cN(C)c > cN(c)C > > So, the question is how to workaround this issue? We already have > millions of such patterns. So, it will work if we will be able to > canonicalize them. However, standard canonicalization does not work, > because we have disable sanitization during SMILES parsing. It returns the > same output as input SMILES. Any ideas are appreciated. > > print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False))) > print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False))) > > cN(C)c > cN(c)C > > This issue actually came from the code of identification of functional > groups. > > Kind regards, > Pavel > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss