Re: [Rdkit-discuss] canonical fragment SMILES

Pavel Polishchuk Fri, 28 Mar 2025 01:01:03 -0700

Thank you, Wim. It works. Even a simpler solution can be to remove allatoms except required ones. I had to guess :)However, this is a bug in the recent RDKit versions. The functionMolFragmentToSmiles works correctly in version 2023, but not in 2024.


On 28/03/2025 00:10, Wim Dehaen wrote:

Pavel,
this is a bit hacky, but you can try the below:
```
def get_frag_smi(mol,frag_atoms):
    if len(frag_atoms) > 1:
        b2b = [] # bonds to break
        fsmi = "" #fragment smiles
        # get bonds outside of fragment
        for b in mol.GetBonds():
            b_idx = b.GetBeginAtomIdx()
            e_idx = b.GetEndAtomIdx()
            if e_idx not in frag_atoms\
            or b_idx not in frag_atoms:
                b2b.append(b.GetIdx())
        # break all bonds except those in fragments
        fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0)
        smis = Chem.MolToSmiles(fmol).split(".")
        # retain the only fragment with more than one atom in there
        while fsmi == "":
            smi = smis.pop(0)
            m = Chem.MolFromSmiles(smi,sanitize=False)
            if len(m.GetAtoms()) > 1:
                fsmi = smi
    else: #one atom, no canonicalize needed
        fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms)
    return fsmi
```

it is based on the observation/assumption that FragmentOnBonds() andthen MolToSmiles() canonizes the fragments cleanly.

> print(get_frag_smi(mol,[1,2,3,17]))
> print(get_frag_smi(mol,[9,10,11,12]))
prints `cN(c)O` twice.


best wishes,
wim

On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk<pavel_polishc...@ukr.net> wrote:


    Hello,

      I encountered an issue with SMILES of fragments. Maybe someone
    may suggest a workaround.
      I attached the notebook, but will also reproduce some code here.

      We have a structure with two Ns and we take an N atom and
    adjacent atoms to make a fragment SMILES and got different
    results, while SMILES represent the same pattern (only the order
    of atoms is different). I guess this happens due to
    canonicalization algorithm, which takes into account some
    additional information missing in the output SMILES (e.g. ring
    membership). For instance, if we break a saturated cycle (bond
    8-9), we get identical SMILES output.

    mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')


    print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
    print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))

    cN(C)c
    cN(c)C

      So, the question is how to workaround this issue? We already
    have millions of such patterns. So, it will work if we will be
    able to canonicalize them. However, standard canonicalization does
    not work, because we have disable sanitization during SMILES
    parsing. It returns the same output as input SMILES. Any ideas are
    appreciated.

    print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
    print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))

    cN(C)c
    cN(c)C

      This issue actually came from the code of identification of
    functional groups.

    Kind regards,
    Pavel
    _______________________________________________
    Rdkit-discuss mailing list
    Rdkit-discuss@lists.sourceforge.net
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] canonical fragment SMILES

Reply via email to