Re: [Rdkit-discuss] canonical fragment SMILES

Wim Dehaen Thu, 27 Mar 2025 16:13:44 -0700

Pavel,
this is a bit hacky, but you can try the below:
```
def get_frag_smi(mol,frag_atoms):
    if len(frag_atoms) > 1:
        b2b = [] # bonds to break
        fsmi = "" #fragment smiles
        # get bonds outside of fragment
        for b in mol.GetBonds():
            b_idx = b.GetBeginAtomIdx()
            e_idx = b.GetEndAtomIdx()
            if e_idx not in frag_atoms\
            or b_idx not in frag_atoms:
                b2b.append(b.GetIdx())
        # break all bonds except those in fragments
        fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0)
        smis = Chem.MolToSmiles(fmol).split(".")
        # retain the only fragment with more than one atom in there
        while fsmi == "":
            smi = smis.pop(0)
            m = Chem.MolFromSmiles(smi,sanitize=False)
            if len(m.GetAtoms()) > 1:
                fsmi = smi
    else: #one atom, no canonicalize needed
        fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms)
    return fsmi
```
it is based on the observation/assumption that FragmentOnBonds() and then
MolToSmiles() canonizes the fragments cleanly.
> print(get_frag_smi(mol,[1,2,3,17]))
> print(get_frag_smi(mol,[9,10,11,12]))
prints `cN(c)O` twice.


best wishes,
wim

On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk <pavel_polishc...@ukr.net>
wrote:

> Hello,
>
>   I encountered an issue with SMILES of fragments. Maybe someone may
> suggest a workaround.
>   I attached the notebook, but will also reproduce some code here.
>
>   We have a structure with two Ns and we take an N atom and adjacent atoms
> to make a fragment SMILES and got different results, while SMILES represent
> the same pattern (only the order of atoms is different). I guess this
> happens due to canonicalization algorithm, which takes into account some
> additional information missing in the output SMILES (e.g. ring membership).
> For instance, if we break a saturated cycle (bond 8-9), we get identical
> SMILES output.
>
> mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')
>
>
> print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
> print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))
>
> cN(C)c
> cN(c)C
>
>   So, the question is how to workaround this issue? We already have
> millions of such patterns. So, it will work if we will be able to
> canonicalize them. However, standard canonicalization does not work,
> because we have disable sanitization during SMILES parsing. It returns the
> same output as input SMILES. Any ideas are appreciated.
>
> print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
> print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))
>
> cN(C)c
> cN(c)C
>
>   This issue actually came from the code of identification of functional
> groups.
>
> Kind regards,
> Pavel
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] canonical fragment SMILES

Reply via email to