Thank you, Wim. It works. Even a simpler solution can be to remove all
atoms except required ones. I had to guess :)
However, this is a bug in the recent RDKit versions. The function
MolFragmentToSmiles works correctly in version 2023, but not in 2024.
On 28/03/2025 00:10, Wim Dehaen wrote:
Pavel,
this is a bit hacky, but you can try the below:
```
def get_frag_smi(mol,frag_atoms):
if len(frag_atoms) > 1:
b2b = [] # bonds to break
fsmi = "" #fragment smiles
# get bonds outside of fragment
for b in mol.GetBonds():
b_idx = b.GetBeginAtomIdx()
e_idx = b.GetEndAtomIdx()
if e_idx not in frag_atoms\
or b_idx not in frag_atoms:
b2b.append(b.GetIdx())
# break all bonds except those in fragments
fmol = Chem.FragmentOnBonds(mol,b2b,addDummies=0)
smis = Chem.MolToSmiles(fmol).split(".")
# retain the only fragment with more than one atom in there
while fsmi == "":
smi = smis.pop(0)
m = Chem.MolFromSmiles(smi,sanitize=False)
if len(m.GetAtoms()) > 1:
fsmi = smi
else: #one atom, no canonicalize needed
fsmi = Chem.MolFragmentToSmiles(mol, frag_atoms)
return fsmi
```
it is based on the observation/assumption that FragmentOnBonds() and
then MolToSmiles() canonizes the fragments cleanly.
> print(get_frag_smi(mol,[1,2,3,17]))
> print(get_frag_smi(mol,[9,10,11,12]))
prints `cN(c)O` twice.
best wishes,
wim
On Thu, Mar 27, 2025 at 12:23 PM Pavel Polishchuk
<pavel_polishc...@ukr.net> wrote:
Hello,
I encountered an issue with SMILES of fragments. Maybe someone
may suggest a workaround.
I attached the notebook, but will also reproduce some code here.
We have a structure with two Ns and we take an N atom and
adjacent atoms to make a fragment SMILES and got different
results, while SMILES represent the same pattern (only the order
of atoms is different). I guess this happens due to
canonicalization algorithm, which takes into account some
additional information missing in the output SMILES (e.g. ring
membership). For instance, if we break a saturated cycle (bond
8-9), we get identical SMILES output.
mol = Chem.MolFromSmiles('CCn1c2cccc3CCn(c23)c2ccccc12')
print(Chem.MolFragmentToSmiles(mol, [1,2,3,17], canonical=True))
print(Chem.MolFragmentToSmiles(mol, [9,10,11,12], canonical=True))
cN(C)c
cN(c)C
So, the question is how to workaround this issue? We already
have millions of such patterns. So, it will work if we will be
able to canonicalize them. However, standard canonicalization does
not work, because we have disable sanitization during SMILES
parsing. It returns the same output as input SMILES. Any ideas are
appreciated.
print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(C)c', sanitize=False)))
print(Chem.MolToSmiles(Chem.MolFromSmiles('cN(c)C', sanitize=False)))
cN(C)c
cN(c)C
This issue actually came from the code of identification of
functional groups.
Kind regards,
Pavel
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss