From: Bennion, Brian
Sent: Monday, August 07, 2017 11:39
To: 'Konrad Koehler' <konrad.koeh...@me.com>
Subject: RE: [Rdkit-discuss] using rdkit to read in chembl23 1.7 million 
compounds

Hello Konrad,
Thank you for your response.
For the handful of compounds i looked at:
multiple ringed compounds that had %11 up to %14 labeled rings coordinated to 
zinc had issues
aromatic carbocations [c+] had issues

As a side note, I attempted reading in the 2D sdf file that chembl supplies.  I 
was able to reduce the failed molecules to 253.
There were still many warnings about stereochemistry being ambiguous and 
strange tags like STY at the end of the molecules.

Brian

From: Konrad Koehler [mailto:konrad.koeh...@me.com]
Sent: Monday, August 07, 2017 11:25
To: Bennion, Brian <benni...@llnl.gov<mailto:benni...@llnl.gov>>
Subject: Re: [Rdkit-discuss] using rdkit to read in chembl23 1.7 million 
compounds

Hi Brain,

Similar problems here in trying to read, fragment, and canonicalize the Zinc 
“In Stock” database of roughly one million compounds. Most of the problematic 
structures contained aromatic sulfur atoms.  (Thiophene itself is no problem.  
Most of the crashes were from more complex heteroaromatic systems containing 
sulfur). Filtering the input file to remove SMILES strings with lowercase “s” 
allowed me to process the rest of the file without RDKit crashing.

Cheers,

Konrad

crash dump:

Can't kekulize mol.
    child_node = AllChem.CanonSmiles(child_node)
  File 
"/Users/konradkoehler/anaconda/lib/python2.7/site-packages/rdkit/Chem/__init__.py",
 line 43, in CanonSmiles
    return MolToSmiles(m, useChiral)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdmolfiles.MolToSmiles(NoneType, int)
did not match C++ signature:
    MolToSmiles(RDKit::ROMol mol, bool isomericSmiles=False, bool 
kekuleSmiles=False, int rootedAtAtom=-1, bool canonical=True, bool 
allBondsExplicit=False, bool allHsExplicit=False)


On 7 Aug 2017, at 18:36, Bennion, Brian 
<benni...@llnl.gov<mailto:benni...@llnl.gov>> wrote:

Hello,

This might be a nit picky question.  I am attempting to read in the smiles 
string for the 1.7 million non-biological compounds in the latest chembl23 
release.  As it turns out 382 compounds fail to be read by RDkit.
The errors are either kekulization failure or valence errors.

Has anyone attempted this task before?
Brian

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org<http://slashdot.org/>! 
http://sdm.link/slashdot_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to