Re: [Rdkit-discuss] enumeration of smiles question

2018-08-10 Thread Esben Jannik Bjerrum via Rdkit-discuss
Hi There,  Just saw this interesting thread :-) The code I posted on GitHub 
https://github.com/EBjerrum/SMILES-enumeration as referenced previously in this 
thread also uses randomization of atom order, similar to Greg's solution here, 
to generate more enumerated SMILES than using the rootedAtom approach. Its not 
a complete enumeration, as there interestingly also seem to be other ways to 
represent the molecules with dots! Thanks, could be interesting to explore!

Nevertheless, the actual enumerator code is wrapped in a couple of objects, 
which can be used to either just generate the SMILES dataset in various forms, 
or do it on the fly as batch generators. That works nicely with the 
fit_generator function of Keras if you use that framework. This avoids memory 
issues with large datasets and is convenient, at the cost of some overhead in 
the training (a few percent longer training).
In some of my recent applications I use the binary format or the mol objects 
directly, instead of round tripping the SMILES over an RDKit molecule.

It seems like the enumeration trick is a nice way to break the SMILES 
serialization of the molecular representation and somehow generate an internal 
representation of the molecule closer to the graph we think of molecules in. I 
did some work with autoencoders as hetereoencoder, trying to encode different 
molecular formats and also from enumerated to enumerated. It seem to work! even 
though I'm presenting a random SMILES and ask the network to encode it to a 
vector and then decode into another randomly chosen SMILES of the same molecule 
during training. Each time a new pair of two randomly generated SMILES of the 
same molecule. The teacher forcing of the decoder is probably crucial here, as 
it lets the decoder correct its later guesses, based on the actual right answer 
pr. character. Doing this seem to have a lot of influence on the latent space 
encoded by the autoencoder, with possible implications for molecular de novo 
generation.
Theres a preprint here: https://arxiv.org/abs/1806.09300
Some researchers at Bayer have independently from me also worked on similar 
approaches and showed improvements for using the latent space representation 
for QSAR modelling.
https://chemrxiv.org/articles/Learning_Continuous_and_Data-Driven_Molecular_Descriptors_by_Translating_Equivalent_Chemical_Representations/6871628
I guess we haven't seen the end of this yet, as there is a lot to explore and 
improve on. Its super fascinating how far a bit of deep learning and data 
augmentation of the SMILES works.
Best RegardsEsben
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] enumeration of smiles question

2018-08-06 Thread Markus Sitzmann
Oh tempora o mores. Didn't we try for ages to make our SMILES canonical and
now, all of sudden, the opposite is hip :-)

On Mon, Aug 6, 2018 at 1:38 PM Chris Earnshaw  wrote:

> Hi
>
> The question 'what do you mean by ALL?' springs to mind. None of the
> discussion includes dot-disconnected SMILES, which are also perfectly valid
> representations. For example C(C1C2)C.C12 is yet another SMILES (of many
> possible) for the example structure.
>
> I've no idea whether this is of any relevance to you, but you should
> probably consider these representations and decide whether they are
> important or not.
>
> Best regards,
> Chris
>
> On 6 August 2018 at 11:27, Jan Halborg Jensen  wrote:
>
>> This blogpost links to two other ones that may have done that (haven’t
>> read them carefully):
>> https://baoilleach.blogspot.com/2018/06/cheminformatics-for-deep-learners.html
>>
>> Best regards, Jan
>>
>> On 06 Aug 2018, at 11:57, Guillaume GODIN 
>> wrote:
>>
>> Dear Greg,
>>
>> Fantastic, thank you to give both explanation and solution to this
>> “simple question”, I know this is not so simple & it’s fundamental for data
>> augmentation in deep learning.
>>
>> If I may, I have another question related, do you know if someone has
>> worked on a generator of all unique smiles independently of RDKit ?
>>
>> Thanks again,
>>
>> Guillaume
>>
>> *De : *Greg Landrum 
>> *Date : *lundi, 6 août 2018 à 11:40
>> *À : *Guillaume GODIN 
>> *Cc : *RDKit Discuss 
>> *Objet : *Re: [Rdkit-discuss] enumeration of smiles question
>>
>>
>> On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN <
>> guillaume.go...@firmenich.com> wrote:
>>
>>
>> I have a simple question about generating all possible smiles of a given
>> molecule:
>>
>>
>> It's a simple question, but the answer is somewhat complicated. :-)
>>
>>
>>
>> RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
>> C1C(CC)C1
>> CCC1CC1
>> C1(CC)CC1
>> C(C)C1CC1
>>
>> While by hand we can write those 7 smiles:
>> CCC1CC1
>> C(C)C1CC1
>> C(C1CC1)C
>> C1CC(CC)1
>> C1C(CC)C1
>> C1CC1CC
>> C(CC)1CC1
>>
>> I use this function for the enumeration:
>>
>> def allsmiles(smil):
>> m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES
>> string.
>> if m is None:
>> return smil
>> N = m.GetNumAtoms()
>> if N==0:
>> return smil
>> try:
>> n= np.random.randint(0,high=N)
>> t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
>> except :
>> return smil
>> return t
>>
>> n= 50
>> SMILES = [“CCC1CC1”]
>> SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]
>>
>> Why we cannot generate all the 7 smiles ?
>>
>>
>> The RDKit has rules that it uses to decide which atom to branch to when
>> generating a SMILES. These are used regardless of whether you are
>> generating canonical SMILES or not.
>> The upshot of this is that it will never generate a SMILES where there's
>> a branch before a ring closure.
>> The other important factor here is that atom rank is determined by the
>> index of the atom in the molecule when you aren't using canonicalization.
>> So changing the atom order on input can help:
>>
>> In [12]: set(allsmiles('CCC1CC1') for i in range(50))
>> Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}
>>
>> In [13]: set(allsmiles('C1CC1CC') for i in range(50))
>> Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}
>>
>> You can do this all at once as follows:
>>
>> ```
>> In [20]: def allsmiles(smil):
>> ...: m = Chem.MolFromSmiles(smil) # Construct a molecule from a
>> SMILES string.
>> ...: if m is None:
>> ...: return smil
>> ...: N = m.GetNumAtoms()
>> ...: if N==0:
>> ...: return smil
>> ...: aids = list(range(N))
>> ...: random.shuffle(aids)
>> ...: m = Chem.RenumberAtoms(m,aids)
>> ...: try:
>> ...: n= random.randint(0,N-1)
>> ...: t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
>> ...: except :
>> ...: return smil
>> ...: return t
>> ...:
>> ...:
>> ...:
>>
>> In [21]:
>>
>> In [2

Re: [Rdkit-discuss] enumeration of smiles question

2018-08-06 Thread Chris Earnshaw
Hi

The question 'what do you mean by ALL?' springs to mind. None of the
discussion includes dot-disconnected SMILES, which are also perfectly valid
representations. For example C(C1C2)C.C12 is yet another SMILES (of many
possible) for the example structure.

I've no idea whether this is of any relevance to you, but you should
probably consider these representations and decide whether they are
important or not.

Best regards,
Chris

On 6 August 2018 at 11:27, Jan Halborg Jensen  wrote:

> This blogpost links to two other ones that may have done that (haven’t
> read them carefully): https://baoilleach.blogspot.com/2018/06/
> cheminformatics-for-deep-learners.html
>
> Best regards, Jan
>
> On 06 Aug 2018, at 11:57, Guillaume GODIN 
> wrote:
>
> Dear Greg,
>
> Fantastic, thank you to give both explanation and solution to this “simple
> question”, I know this is not so simple & it’s fundamental for data
> augmentation in deep learning.
>
> If I may, I have another question related, do you know if someone has
> worked on a generator of all unique smiles independently of RDKit ?
>
> Thanks again,
>
> Guillaume
>
> *De : *Greg Landrum 
> *Date : *lundi, 6 août 2018 à 11:40
> *À : *Guillaume GODIN 
> *Cc : *RDKit Discuss 
> *Objet : *Re: [Rdkit-discuss] enumeration of smiles question
>
>
> On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN <
> guillaume.go...@firmenich.com> wrote:
>
>
> I have a simple question about generating all possible smiles of a given
> molecule:
>
>
> It's a simple question, but the answer is somewhat complicated. :-)
>
>
>
> RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
> C1C(CC)C1
> CCC1CC1
> C1(CC)CC1
> C(C)C1CC1
>
> While by hand we can write those 7 smiles:
> CCC1CC1
> C(C)C1CC1
> C(C1CC1)C
> C1CC(CC)1
> C1C(CC)C1
> C1CC1CC
> C(CC)1CC1
>
> I use this function for the enumeration:
>
> def allsmiles(smil):
> m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES
> string.
> if m is None:
> return smil
> N = m.GetNumAtoms()
> if N==0:
> return smil
> try:
> n= np.random.randint(0,high=N)
> t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
> except :
> return smil
> return t
>
> n= 50
> SMILES = [“CCC1CC1”]
> SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]
>
> Why we cannot generate all the 7 smiles ?
>
>
> The RDKit has rules that it uses to decide which atom to branch to when
> generating a SMILES. These are used regardless of whether you are
> generating canonical SMILES or not.
> The upshot of this is that it will never generate a SMILES where there's a
> branch before a ring closure.
> The other important factor here is that atom rank is determined by the
> index of the atom in the molecule when you aren't using canonicalization.
> So changing the atom order on input can help:
>
> In [12]: set(allsmiles('CCC1CC1') for i in range(50))
> Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}
>
> In [13]: set(allsmiles('C1CC1CC') for i in range(50))
> Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}
>
> You can do this all at once as follows:
>
> ```
> In [20]: def allsmiles(smil):
> ...: m = Chem.MolFromSmiles(smil) # Construct a molecule from a
> SMILES string.
> ...: if m is None:
> ...: return smil
> ...: N = m.GetNumAtoms()
> ...: if N==0:
> ...: return smil
> ...: aids = list(range(N))
> ...: random.shuffle(aids)
> ...: m = Chem.RenumberAtoms(m,aids)
> ...: try:
> ...: n= random.randint(0,N-1)
> ...: t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
> ...: except :
> ...: return smil
> ...: return t
> ...:
> ...:
> ...:
>
> In [21]:
>
> In [21]: set(allsmiles('C1CC1CC') for i in range(50))
> Out[21]: {'C(C)C1CC1', 'C(C1CC1)C', 'C1(CC)CC1', 'C1C(CC)C1', 'C1CC1CC',
> 'CCC1CC1'}
> ```
> Note that I switched to using python's built in random module instead of
> using the one in numpy.
>
> -greg
>
>
>
>
>
> Thanks guys,
>
> Best regards,
>
> Guillaume
> 
> ***
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the ex

Re: [Rdkit-discuss] enumeration of smiles question

2018-08-06 Thread Jan Halborg Jensen
This blogpost links to two other ones that may have done that (haven’t read 
them carefully): 
https://baoilleach.blogspot.com/2018/06/cheminformatics-for-deep-learners.html

Best regards, Jan

On 06 Aug 2018, at 11:57, Guillaume GODIN 
mailto:guillaume.go...@firmenich.com>> wrote:

Dear Greg,

Fantastic, thank you to give both explanation and solution to this “simple 
question”, I know this is not so simple & it’s fundamental for data 
augmentation in deep learning.

If I may, I have another question related, do you know if someone has worked on 
a generator of all unique smiles independently of RDKit ?

Thanks again,

Guillaume

De : Greg Landrum mailto:greg.land...@gmail.com>>
Date : lundi, 6 août 2018 à 11:40
À : Guillaume GODIN 
mailto:guillaume.go...@firmenich.com>>
Cc : RDKit Discuss 
mailto:rdkit-discuss@lists.sourceforge.net>>
Objet : Re: [Rdkit-discuss] enumeration of smiles question


On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN 
mailto:guillaume.go...@firmenich.com>> wrote:

I have a simple question about generating all possible smiles of a given 
molecule:

It's a simple question, but the answer is somewhat complicated. :-)


RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
C1C(CC)C1
CCC1CC1
C1(CC)CC1
C(C)C1CC1

While by hand we can write those 7 smiles:
CCC1CC1
C(C)C1CC1
C(C1CC1)C
C1CC(CC)1
C1C(CC)C1
C1CC1CC
C(CC)1CC1

I use this function for the enumeration:

def allsmiles(smil):
m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES string.
if m is None:
return smil
N = m.GetNumAtoms()
if N==0:
return smil
try:
n= np.random.randint(0,high=N)
t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
except :
return smil
return t

n= 50
SMILES = [“CCC1CC1”]
SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]

Why we cannot generate all the 7 smiles ?

The RDKit has rules that it uses to decide which atom to branch to when 
generating a SMILES. These are used regardless of whether you are generating 
canonical SMILES or not.
The upshot of this is that it will never generate a SMILES where there's a 
branch before a ring closure.
The other important factor here is that atom rank is determined by the index of 
the atom in the molecule when you aren't using canonicalization. So changing 
the atom order on input can help:
In [12]: set(allsmiles('CCC1CC1') for i in range(50))
Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}

In [13]: set(allsmiles('C1CC1CC') for i in range(50))
Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}
You can do this all at once as follows:

```
In [20]: def allsmiles(smil):
...: m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES 
string.
...: if m is None:
...: return smil
...: N = m.GetNumAtoms()
...: if N==0:
...: return smil
...: aids = list(range(N))
...: random.shuffle(aids)
...: m = Chem.RenumberAtoms(m,aids)
...: try:
...: n= random.randint(0,N-1)
...: t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
...: except :
...: return smil
...: return t
...:
...:
...:

In [21]:

In [21]: set(allsmiles('C1CC1CC') for i in range(50))
Out[21]: {'C(C)C1CC1', 'C(C1CC1)C', 'C1(CC)CC1', 'C1C(CC)C1', 'C1CC1CC', 
'CCC1CC1'}
```
Note that I switched to using python's built in random module instead of using 
the one in numpy.

-greg




Thanks guys,

Best regards,

Guillaume
***
DISCLAIMER
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the extent that it relates 
to their official business.
***
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org<http://slashdot.org/>! 
http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
***
DISCLAIMER
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the exte

Re: [Rdkit-discuss] enumeration of smiles question

2018-08-06 Thread Shojiro Shibayama
Dear Guillaume,

Sorry for interruption, but you've mentioned to this paper, haven't you?
"SMILES Enumeration as Data Augmentation for Neural Network Modeling of
Molecules"
https://arxiv.org/pdf/1703.07076.pdf

The author says that RDKit is used in the paper. And its implementation is
published on github: https://github.com/Ebjerrum/SMILES-enumeration
I wish that this will help you.

Best regards,
Shojiro

On 6 August 2018 at 18:57, Guillaume GODIN 
wrote:

> Dear Greg,
>
>
>
> Fantastic, thank you to give both explanation and solution to this “simple
> question”, I know this is not so simple & it’s fundamental for data
> augmentation in deep learning.
>
>
>
> If I may, I have another question related, do you know if someone has
> worked on a generator of all unique smiles independently of RDKit ?
>
>
>
> Thanks again,
>
>
>
> Guillaume
>
>
>
> *De : *Greg Landrum 
> *Date : *lundi, 6 août 2018 à 11:40
> *À : *Guillaume GODIN 
> *Cc : *RDKit Discuss 
> *Objet : *Re: [Rdkit-discuss] enumeration of smiles question
>
>
>
>
>
> On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN <
> guillaume.go...@firmenich.com> wrote:
>
>
>
> I have a simple question about generating all possible smiles of a given
> molecule:
>
>
>
> It's a simple question, but the answer is somewhat complicated. :-)
>
>
>
>
>
> RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
>
> C1C(CC)C1
>
> CCC1CC1
>
> C1(CC)CC1
>
> C(C)C1CC1
>
>
>
> While by hand we can write those 7 smiles:
>
> CCC1CC1
>
> C(C)C1CC1
>
> C(C1CC1)C
>
> C1CC(CC)1
>
> C1C(CC)C1
>
> C1CC1CC
>
> C(CC)1CC1
>
>
>
> I use this function for the enumeration:
>
>
>
> def allsmiles(smil):
>
> m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES
> string.
>
> if m is None:
>
> return smil
>
> N = m.GetNumAtoms()
>
> if N==0:
>
> return smil
>
> try:
>
> n= np.random.randint(0,high=N)
>
> t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
>
> except :
>
> return smil
>
> return t
>
>
>
> n= 50
>
> SMILES = [“CCC1CC1”]
>
> SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]
>
>
>
> Why we cannot generate all the 7 smiles ?
>
>
>
> The RDKit has rules that it uses to decide which atom to branch to when
> generating a SMILES. These are used regardless of whether you are
> generating canonical SMILES or not.
>
> The upshot of this is that it will never generate a SMILES where there's a
> branch before a ring closure.
>
> The other important factor here is that atom rank is determined by the
> index of the atom in the molecule when you aren't using canonicalization.
> So changing the atom order on input can help:
>
> In [12]: set(allsmiles('CCC1CC1') for i in range(50))
>
> Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}
>
>
>
> In [13]: set(allsmiles('C1CC1CC') for i in range(50))
>
> Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}
>
> You can do this all at once as follows:
>
>
>
> ```
>
> In [20]: def allsmiles(smil):
>
> ...: m = Chem.MolFromSmiles(smil) # Construct a molecule from a
> SMILES string.
>
> ...: if m is None:
>
> ...: return smil
>
> ...: N = m.GetNumAtoms()
>
> ...: if N==0:
>
> ...: return smil
>
> ...: aids = list(range(N))
>
> ...: random.shuffle(aids)
>
> ...: m = Chem.RenumberAtoms(m,aids)
>
> ...: try:
>
> ...: n= random.randint(0,N-1)
>
> ...: t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
>
> ...: except :
>
> ...: return smil
>
> ...: return t
>
> ...:
>
> ...:
>
> ...:
>
>
>
> In [21]:
>
>
>
> In [21]: set(allsmiles('C1CC1CC') for i in range(50))
>
> Out[21]: {'C(C)C1CC1', 'C(C1CC1)C', 'C1(CC)CC1', 'C1C(CC)C1', 'C1CC1CC',
> 'CCC1CC1'}
>
> ```
>
> Note that I switched to using python's built in random module instead of
> using the one in numpy.
>
>
>
> -greg
>
>
>
>
>
>
>
>
>
> Thanks guys,
>
>
>
> Best regards,
>
>
>
> Guillaume
>
> 
> ***
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterati

Re: [Rdkit-discuss] enumeration of smiles question

2018-08-06 Thread Peter S. Shenkin
Just curious, Guillaume, why do you want to do this?

On Mon, Aug 6, 2018 at 5:58 AM Guillaume GODIN <
guillaume.go...@firmenich.com> wrote:

> Dear Greg,
>
>
>
> Fantastic, thank you to give both explanation and solution to this “simple
> question”, I know this is not so simple & it’s fundamental for data
> augmentation in deep learning.
>
>
>
> If I may, I have another question related, do you know if someone has
> worked on a generator of all unique smiles independently of RDKit ?
>
>
>
> Thanks again,
>
>
>
> Guillaume
>
>
>
> *De : *Greg Landrum 
> *Date : *lundi, 6 août 2018 à 11:40
> *À : *Guillaume GODIN 
> *Cc : *RDKit Discuss 
> *Objet : *Re: [Rdkit-discuss] enumeration of smiles question
>
>
>
>
>
> On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN <
> guillaume.go...@firmenich.com> wrote:
>
>
>
> I have a simple question about generating all possible smiles of a given
> molecule:
>
>
>
> It's a simple question, but the answer is somewhat complicated. :-)
>
>
>
>
>
> RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
>
> C1C(CC)C1
>
> CCC1CC1
>
> C1(CC)CC1
>
> C(C)C1CC1
>
>
>
> While by hand we can write those 7 smiles:
>
> CCC1CC1
>
> C(C)C1CC1
>
> C(C1CC1)C
>
> C1CC(CC)1
>
> C1C(CC)C1
>
> C1CC1CC
>
> C(CC)1CC1
>
>
>
> I use this function for the enumeration:
>
>
>
> def allsmiles(smil):
>
> m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES
> string.
>
> if m is None:
>
> return smil
>
> N = m.GetNumAtoms()
>
> if N==0:
>
> return smil
>
> try:
>
> n= np.random.randint(0,high=N)
>
> t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
>
> except :
>
> return smil
>
> return t
>
>
>
> n= 50
>
> SMILES = [“CCC1CC1”]
>
> SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]
>
>
>
> Why we cannot generate all the 7 smiles ?
>
>
>
> The RDKit has rules that it uses to decide which atom to branch to when
> generating a SMILES. These are used regardless of whether you are
> generating canonical SMILES or not.
>
> The upshot of this is that it will never generate a SMILES where there's a
> branch before a ring closure.
>
> The other important factor here is that atom rank is determined by the
> index of the atom in the molecule when you aren't using canonicalization.
> So changing the atom order on input can help:
>
> In [12]: set(allsmiles('CCC1CC1') for i in range(50))
>
> Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}
>
>
>
> In [13]: set(allsmiles('C1CC1CC') for i in range(50))
>
> Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}
>
> You can do this all at once as follows:
>
>
>
> ```
>
> In [20]: def allsmiles(smil):
>
> ...: m = Chem.MolFromSmiles(smil) # Construct a molecule from a
> SMILES string.
>
> ...: if m is None:
>
> ...: return smil
>
> ...: N = m.GetNumAtoms()
>
> ...: if N==0:
>
> ...: return smil
>
> ...: aids = list(range(N))
>
> ...: random.shuffle(aids)
>
> ...: m = Chem.RenumberAtoms(m,aids)
>
> ...: try:
>
> ...: n= random.randint(0,N-1)
>
> ...: t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
>
> ...: except :
>
> ...: return smil
>
> ...: return t
>
> ...:
>
> ...:
>
> ...:
>
>
>
> In [21]:
>
>
>
> In [21]: set(allsmiles('C1CC1CC') for i in range(50))
>
> Out[21]: {'C(C)C1CC1', 'C(C1CC1)C', 'C1(CC)CC1', 'C1C(CC)C1', 'C1CC1CC',
> 'CCC1CC1'}
>
> ```
>
> Note that I switched to using python's built in random module instead of
> using the one in numpy.
>
>
>
> -greg
>
>
>
>
>
>
>
>
>
> Thanks guys,
>
>
>
> Best regards,
>
>
>
> Guillaume
>
>
> ***
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
>
> ***

Re: [Rdkit-discuss] enumeration of smiles question

2018-08-06 Thread Guillaume GODIN
Dear Greg,

Fantastic, thank you to give both explanation and solution to this “simple 
question”, I know this is not so simple & it’s fundamental for data 
augmentation in deep learning.

If I may, I have another question related, do you know if someone has worked on 
a generator of all unique smiles independently of RDKit ?

Thanks again,

Guillaume

De : Greg Landrum 
Date : lundi, 6 août 2018 à 11:40
À : Guillaume GODIN 
Cc : RDKit Discuss 
Objet : Re: [Rdkit-discuss] enumeration of smiles question


On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN 
mailto:guillaume.go...@firmenich.com>> wrote:

I have a simple question about generating all possible smiles of a given 
molecule:

It's a simple question, but the answer is somewhat complicated. :-)


RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
C1C(CC)C1
CCC1CC1
C1(CC)CC1
C(C)C1CC1

While by hand we can write those 7 smiles:
CCC1CC1
C(C)C1CC1
C(C1CC1)C
C1CC(CC)1
C1C(CC)C1
C1CC1CC
C(CC)1CC1

I use this function for the enumeration:

def allsmiles(smil):
m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES string.
if m is None:
return smil
N = m.GetNumAtoms()
if N==0:
return smil
try:
n= np.random.randint(0,high=N)
t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
except :
return smil
return t

n= 50
SMILES = [“CCC1CC1”]
SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]

Why we cannot generate all the 7 smiles ?

The RDKit has rules that it uses to decide which atom to branch to when 
generating a SMILES. These are used regardless of whether you are generating 
canonical SMILES or not.
The upshot of this is that it will never generate a SMILES where there's a 
branch before a ring closure.
The other important factor here is that atom rank is determined by the index of 
the atom in the molecule when you aren't using canonicalization. So changing 
the atom order on input can help:
In [12]: set(allsmiles('CCC1CC1') for i in range(50))
Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}

In [13]: set(allsmiles('C1CC1CC') for i in range(50))
Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}
You can do this all at once as follows:

```
In [20]: def allsmiles(smil):
...: m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES 
string.
...: if m is None:
...: return smil
...: N = m.GetNumAtoms()
...: if N==0:
...: return smil
...: aids = list(range(N))
...: random.shuffle(aids)
...: m = Chem.RenumberAtoms(m,aids)
...: try:
...: n= random.randint(0,N-1)
...: t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
...: except :
...: return smil
...: return t
...:
...:
...:

In [21]:

In [21]: set(allsmiles('C1CC1CC') for i in range(50))
Out[21]: {'C(C)C1CC1', 'C(C1CC1)C', 'C1(CC)CC1', 'C1C(CC)C1', 'C1CC1CC', 
'CCC1CC1'}
```
Note that I switched to using python's built in random module instead of using 
the one in numpy.

-greg




Thanks guys,

Best regards,

Guillaume
***
DISCLAIMER
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the extent that it relates 
to their official business.
***
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! 
http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

***
DISCLAIMER  
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the extent that it relates 
to their official business.  
***
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.ne

Re: [Rdkit-discuss] enumeration of smiles question

2018-08-06 Thread Greg Landrum
On Thu, Aug 2, 2018 at 8:59 AM Guillaume GODIN <
guillaume.go...@firmenich.com> wrote:

>
>
> I have a simple question about generating all possible smiles of a given
> molecule:
>
>
It's a simple question, but the answer is somewhat complicated. :-)


>
>
> RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
>
> C1C(CC)C1
>
> CCC1CC1
>
> C1(CC)CC1
>
> C(C)C1CC1
>
>
>
> While by hand we can write those 7 smiles:
>
> CCC1CC1
>
> C(C)C1CC1
>
> C(C1CC1)C
>
> C1CC(CC)1
>
> C1C(CC)C1
>
> C1CC1CC
>
> C(CC)1CC1
>
>
>
> I use this function for the enumeration:
>
>
>
> def allsmiles(smil):
>
> m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES
> string.
>
> if m is None:
>
> return smil
>
> N = m.GetNumAtoms()
>
> if N==0:
>
> return smil
>
> try:
>
> n= np.random.randint(0,high=N)
>
> t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
>
> except :
>
> return smil
>
> return t
>
>
>
> n= 50
>
> SMILES = [“CCC1CC1”]
>
> SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]
>
>
>
> Why we cannot generate all the 7 smiles ?
>

The RDKit has rules that it uses to decide which atom to branch to when
generating a SMILES. These are used regardless of whether you are
generating canonical SMILES or not.
The upshot of this is that it will never generate a SMILES where there's a
branch before a ring closure.
The other important factor here is that atom rank is determined by the
index of the atom in the molecule when you aren't using canonicalization.
So changing the atom order on input can help:

In [12]: set(allsmiles('CCC1CC1') for i in range(50))
Out[12]: {'C(C)C1CC1', 'C1(CC)CC1', 'C1C(CC)C1', 'CCC1CC1'}

In [13]: set(allsmiles('C1CC1CC') for i in range(50))
Out[13]: {'C(C1CC1)C', 'C1(CC)CC1', 'C1CC1CC', 'CCC1CC1'}

You can do this all at once as follows:

```
In [20]: def allsmiles(smil):
...: m = Chem.MolFromSmiles(smil) # Construct a molecule from a
SMILES string.
...: if m is None:
...: return smil
...: N = m.GetNumAtoms()
...: if N==0:
...: return smil
...: aids = list(range(N))
...: random.shuffle(aids)
...: m = Chem.RenumberAtoms(m,aids)
...: try:
...: n= random.randint(0,N-1)
...: t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
...: except :
...: return smil
...: return t
...:
...:
...:

In [21]:

In [21]: set(allsmiles('C1CC1CC') for i in range(50))
Out[21]: {'C(C)C1CC1', 'C(C1CC1)C', 'C1(CC)CC1', 'C1C(CC)C1', 'C1CC1CC',
'CCC1CC1'}
```
Note that I switched to using python's built in random module instead of
using the one in numpy.

-greg




>
>
> Thanks guys,
>
>
>
> Best regards,
>
>
>
> Guillaume
>
> ***
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
>
> ***
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] enumeration of smiles question

2018-08-02 Thread Guillaume GODIN
Dear All RDKiters,

I have a simple question about generating all possible smiles of a given 
molecule:

RDKit provides only 4 differents smiles for my molecule “CCC1CC1“:
C1C(CC)C1
CCC1CC1
C1(CC)CC1
C(C)C1CC1

While by hand we can write those 7 smiles:
CCC1CC1
C(C)C1CC1
C(C1CC1)C
C1CC(CC)1
C1C(CC)C1
C1CC1CC
C(CC)1CC1

I use this function for the enumeration:

def allsmiles(smil):
m = Chem.MolFromSmiles(smil) # Construct a molecule from a SMILES string.
if m is None:
return smil
N = m.GetNumAtoms()
if N==0:
return smil
try:
n= np.random.randint(0,high=N)
t= Chem.MolToSmiles(m, rootedAtAtom=n, canonical=False)
except :
return smil
return t

n= 50
SMILES = [“CCC1CC1”]
SMILES_mult = [allsmiles(S) for S in SMILES for i in range(n)]

Why we cannot generate all the 7 smiles ?

Thanks guys,

Best regards,

Guillaume

***
DISCLAIMER  
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the extent that it relates 
to their official business.  
***
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss