Pavel,
this is a bit hacky, but you can try the below:
```
def get_frag_smi(mol,frag_atoms):
if len(frag_atoms) > 1:
b2b = [] # bonds to break
fsmi = "" #fragment smiles
# get bonds outside of fragment
for b in mol.GetBonds():
b_idx = b.GetBeginAtomIdx()
e_idx = b.GetEndAtomIdx()
if e_idx not in frag_atoms\
or b_idx not in frag_atoms:
b2b.append(b.GetIdx())
# break all bonds except those in fragments
fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0)
smis = Chem.MolToSmiles(fmol).split(".")
# retain the only fragment with more than one atom in there
while fsmi == "":
smi = smis.pop(0)
m = Chem.MolFromSmiles(smi,sanitize=False)
if len(m.GetAtoms()) > 1:
fsmi = smi
else: #one atom, no canonicalize needed
fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms)
return fsmi
```
it is based on the observation/assumption that FragmentOnBonds() and then
MolToSmiles() canonizes the fragments cleanly.
> print(get_frag_smi(mol,[1,2,3,17]))
> print(get_frag_smi(mol,[9,10,11,12]))
prints `cN(c)O` twice.
best wishes,
wim
On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk <[email protected]>
wrote:
> Hello,
>
> I encountered an issue with SMILES of fragments. Maybe someone may
> suggest a workaround.
> I attached the notebook, but will also reproduce some code here.
>
> We have a structure with two Ns and we take an N atom and adjacent atoms
> to make a fragment SMILES and got different results, while SMILES represent
> the same pattern (only the order of atoms is different). I guess this
> happens due to canonicalization algorithm, which takes into account some
> additional information missing in the output SMILES (e.g. ring membership).
> For instance, if we break a saturated cycle (bond 8-9), we get identical
> SMILES output.
>
> mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')
>
>
> print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
> print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))
>
> cN(C)c
> cN(c)C
>
> So, the question is how to workaround this issue? We already have
> millions of such patterns. So, it will work if we will be able to
> canonicalize them. However, standard canonicalization does not work,
> because we have disable sanitization during SMILES parsing. It returns the
> same output as input SMILES. Any ideas are appreciated.
>
> print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
> print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))
>
> cN(C)c
> cN(c)C
>
> This issue actually came from the code of identification of functional
> groups.
>
> Kind regards,
> Pavel
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss