This blogpost links to two other ones that may have done that (haven’t read 
them carefully): 
https://baoilleach.blogspot.com/2018/06/cheminformatics-for-deep-learners.html

Best regards, Jan

On 06 Aug 2018, at 11:57, Guillaume GODIN 
<guillaume.go...@firmenich.com<mailto:guillaume.go...@firmenich.com>> wrote:

Dear Greg,

Fantastic, thank you to give both explanation and solution to this “simple 
question”, I know this is not so simple & it’s fundamental for data 
augmentation in deep learning.

If I may, I have another question related, do you know if someone has worked on 
a generator of all unique smiles independently of RDKit ?

Thanks again,

Guillaume

De : Greg Landrum <greg.land...@gmail.com<mailto:greg.land...@gmail.com>>
Date : lundi, 6 août 2018 à 11:40
À : Guillaume GODIN 
<guillaume.go...@firmenich.com<mailto:guillaume.go...@firmenich.com>>
Cc : RDKit Discuss 
<rdkit-discuss@lists.sourceforge.net<mailto:rdkit-discuss@lists.sourceforge.net>>
Objet : Re: [Rdkit-discuss] enumeration of smiles question


On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN 
<guillaume.go...@firmenich.com<mailto:guillaume.go...@firmenich.com>> wrote:

I have a simple question about generating all possible smiles of a given 
molecule:

It's a simple question, but the answer is somewhat complicated. :-)


RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
C1C(CC)C1
CCC1CC1
C1(CC)CC1
C(C)C1CC1

While by hand we can write those 7 smiles:
CCC1CC1
C(C)C1CC1
C(C1CC1)C
C1CC(CC)1
C1C(CC)C1
C1CC1CC
C(CC)1CC1

I use this function for the enumeration:

def allsmiles(smil):
    m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES string.
    if m is None:
        return smil
    N = m.GetNumAtoms()
    if N==0:
        return smil
    try:
        n= np.random.randint(0,high=N)
        t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
    except :
        return smil
    return t

n= 50
SMILES = [“CCC1CC1”]
SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]

Why we cannot generate all the 7 smiles ?

The RDKit has rules that it uses to decide which atom to branch to when 
generating a SMILES. These are used regardless of whether you are generating 
canonical SMILES or not.
The upshot of this is that it will never generate a SMILES where there's a 
branch before a ring closure.
The other important factor here is that atom rank is determined by the index of 
the atom in the molecule when you aren't using canonicalization. So changing 
the atom order on input can help:
In [12]: set(allsmiles('CCC1CC1') for i in range(50))
Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}

In [13]: set(allsmiles('C1CC1CC') for i in range(50))
Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}
You can do this all at once as follows:

```
In [20]: def allsmiles(smil):
    ...:     m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES 
string.
    ...:     if m is None:
    ...:         return smil
    ...:     N = m.GetNumAtoms()
    ...:     if N==0:
    ...:         return smil
    ...:     aids = list(range(N))
    ...:     random.shuffle(aids)
    ...:     m = Chem.RenumberAtoms(m,aids)
    ...:     try:
    ...:         n= random.randint(0,N-1)
    ...:         t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
    ...:     except :
    ...:         return smil
    ...:     return t
    ...:
    ...:
    ...:

In [21]:

In [21]: set(allsmiles('C1CC1CC') for i in range(50))
Out[21]: {'C(C)C1CC1', 'C(C1CC1)C', 'C1(CC)CC1', 'C1C(CC)C1', 'C1CC1CC', 
'CCC1CC1'}
```
Note that I switched to using python's built in random module instead of using 
the one in numpy.

-greg




Thanks guys,

Best regards,

Guillaume
***********************************************************************************
DISCLAIMER
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the extent that it relates 
to their official business.
***********************************************************************************
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org<http://slashdot.org/>! 
http://sdm.link/slashdot_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
***********************************************************************************
DISCLAIMER
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the extent that it relates 
to their official business.
***********************************************************************************
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org<http://slashdot.org/>! 
http://sdm.link/slashdot_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to