Hi There,  Just saw this interesting thread :-) The code I posted on GitHub 
https://github.com/EBjerrum/SMILES-enumeration as referenced previously in this 
thread also uses randomization of atom order, similar to Greg's solution here, 
to generate more enumerated SMILES than using the rootedAtom approach. Its not 
a complete enumeration, as there interestingly also seem to be other ways to 
represent the molecules with dots! Thanks, could be interesting to explore!

Nevertheless, the actual enumerator code is wrapped in a couple of objects, 
which can be used to either just generate the SMILES dataset in various forms, 
or do it on the fly as batch generators. That works nicely with the 
fit_generator function of Keras if you use that framework. This avoids memory 
issues with large datasets and is convenient, at the cost of some overhead in 
the training (a few percent longer training).
In some of my recent applications I use the binary format or the mol objects 
directly, instead of round tripping the SMILES over an RDKit molecule.

It seems like the enumeration trick is a nice way to break the SMILES 
serialization of the molecular representation and somehow generate an internal 
representation of the molecule closer to the graph we think of molecules in. I 
did some work with autoencoders as hetereoencoder, trying to encode different 
molecular formats and also from enumerated to enumerated. It seem to work! even 
though I'm presenting a random SMILES and ask the network to encode it to a 
vector and then decode into another randomly chosen SMILES of the same molecule 
during training. Each time a new pair of two randomly generated SMILES of the 
same molecule. The teacher forcing of the decoder is probably crucial here, as 
it lets the decoder correct its later guesses, based on the actual right answer 
pr. character. Doing this seem to have a lot of influence on the latent space 
encoded by the autoencoder, with possible implications for molecular de novo 
generation.
Theres a preprint here: https://arxiv.org/abs/1806.09300
Some researchers at Bayer have independently from me also worked on similar 
approaches and showed improvements for using the latent space representation 
for QSAR modelling.
https://chemrxiv.org/articles/Learning_Continuous_and_Data-Driven_Molecular_Descriptors_by_Translating_Equivalent_Chemical_Representations/6871628
I guess we haven't seen the end of this yet, as there is a lot to explore and 
improve on. Its super fascinating how far a bit of deep learning and data 
augmentation of the SMILES works.
Best RegardsEsben
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to