From: Bennion, Brian
Sent: Monday, August 07, 2017 11:39
To: 'Konrad Koehler' <konrad.koeh...@me.com>
Subject: RE: [Rdkit-discuss] using rdkit to read in chembl23 1.7 million
compounds
Hello Konrad,
Thank you for your response.
For the handful of compounds i looked at:
multiple ringed compounds that had %11 up to %14 labeled rings coordinated to
zinc had issues
aromatic carbocations [c+] had issues
As a side note, I attempted reading in the 2D sdf file that chembl supplies. I
was able to reduce the failed molecules to 253.
There were still many warnings about stereochemistry being ambiguous and
strange tags like STY at the end of the molecules.
Brian
From: Konrad Koehler [mailto:konrad.koeh...@me.com]
Sent: Monday, August 07, 2017 11:25
To: Bennion, Brian <benni...@llnl.gov<mailto:benni...@llnl.gov>>
Subject: Re: [Rdkit-discuss] using rdkit to read in chembl23 1.7 million
compounds
Hi Brain,
Similar problems here in trying to read, fragment, and canonicalize the Zinc
“In Stock” database of roughly one million compounds. Most of the problematic
structures contained aromatic sulfur atoms. (Thiophene itself is no problem.
Most of the crashes were from more complex heteroaromatic systems containing
sulfur). Filtering the input file to remove SMILES strings with lowercase “s”
allowed me to process the rest of the file without RDKit crashing.
Cheers,
Konrad
crash dump:
Can't kekulize mol.
child_node = AllChem.CanonSmiles(child_node)
File
"/Users/konradkoehler/anaconda/lib/python2.7/site-packages/rdkit/Chem/__init__.py",
line 43, in CanonSmiles
return MolToSmiles(m, useChiral)
Boost.Python.ArgumentError: Python argument types in
rdkit.Chem.rdmolfiles.MolToSmiles(NoneType, int)
did not match C++ signature:
MolToSmiles(RDKit::ROMol mol, bool isomericSmiles=False, bool
kekuleSmiles=False, int rootedAtAtom=-1, bool canonical=True, bool
allBondsExplicit=False, bool allHsExplicit=False)
On 7 Aug 2017, at 18:36, Bennion, Brian
<benni...@llnl.gov<mailto:benni...@llnl.gov>> wrote:
Hello,
This might be a nit picky question. I am attempting to read in the smiles
string for the 1.7 million non-biological compounds in the latest chembl23
release. As it turns out 382 compounds fail to be read by RDkit.
The errors are either kekulization failure or valence errors.
Has anyone attempted this task before?
Brian
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org<http://slashdot.org/>!
http://sdm.link/slashdot_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss