Hello. A very interesting initiative and rather related to what I am 
working in.

I am working in the Crystallography Open Database (COD,
www.crystallography.net), a large collection of openly accessible CIF
files. I have posted a few times messages previously to this list. COD
contains today 245336 files, probably a noticeable proportion of your
44000 may be already included here.

The task I am actually doing is extracting the chemical connectivity 
from the CIF files and storing it in SMILES format, so that chemical 
substructure search can be performed on it. At present the SMILES 
collection is around 70000 entries. The conversion is done trough 
OpenBabel.

Doing this, I have found that the results are rather satisfactory for
organic compounds but generally they are not for inorganic ones. In most
cases, the results are not perhaps a "bug", but simply a representation
of the molecule that is not coincident with the one that an inorganic
chemist will usually have in his/her mind (or should I say "in my
mind?"). In other cases, the results are wrong, specially with the
appearance of spurious H-atoms. Because of this, I would be able to add
a lot of stuff to your list of unsatisfactory conversions from CIF
files.

I need to review and, in most cases, fix the SMILES chains coming out
from OpenBabel for inorganic compounds (either manually or
semiautomatically). I am also stuck to version 2.2.3 because versions
newer than this perform worse for inorganic compounds.

These facts are understandable since the bond and valence concepts that
are behind the spirit of cheminformatics formats and cheminformatics in
general are mostly in the valence bond theory realm and thus in the
organic chemistry formalism, many of these concepts (such as the
definition of "double" and "aromatic" bonds) become more dubious when
metal atoms are present. Also, the behaviour of common "organic" atoms
is different when they interact with metals. A clear example of this is
nitrogen: OpenBabel does not consider that this element usually binds to
metal atoms through its lone pair (thus forming four bonds) and this
introduces a lot of mistakes when OpenBabel tries to keep its trivalent
state at all costs.

By the way, you can use the SMILES COD collection, many of which have
been humanly revised, for your task if you think it may help you in any
way. I do not know if a SMILES string has enough information to build an
acceptable MOL file, though.

Links in chemspider to COD CIFs are of course welcome.

Best wishes,
Miguel Quirós

El lun, 09-12-2013 a las 08:44 -0800, daya escribió: 
> I’ve just supervised a student project in which we used OpenBabel to convert
> over 44,000 Royal Society of Chemistry CIF structures to mol files, then a
> student checked over 4,000 of these conversions so that we could upload the
> successfully processed CIFs to ChemSpider for the corresponding ChemSpider
> compounds. A summary of the results of that project are detailed here:
> http://www.chemspider.com/blog/adding-rsc-cifs-to-chemspider.html 
> It seemed like a valuable opportunity to identify the most frequent
> OpenBabel bugs when doing a CIF to Mol conversion so these are documented in
> there, along with test cases to identify the problems and with a view to
> fixing them and making OpenBabel more bulletproof.
> We’re taking a bit of a break from this project for now, but in the next
> phase of the project will see if we can fix at least some of the bugs
> identified if they haven’t already been. 
> But we’re sharing these results here for now though since we thought you
> would be interested in the project and the performance of OpenBabel when run
> over such a large and varied test set, possibly even enough to look into
> some of them yourselves… 
> Looking forward to working with you on some of them in the future…
> Aileen Day (Informatics Analyst, RSC ChemSpider)
> 
> 
> 
> --
> View this message in context: 
> http://forums.openbabel.org/Converting-CIFs-to-mols-using-OpenBabel-tp4657031.html
> Sent from the General discussion mailing list archive at Nabble.com.
> 
> ------------------------------------------------------------------------------
> Sponsored by Intel(R) XDK 
> Develop, test and display web and hybrid apps with a single code base.
> Download it for free now!
> http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
> _______________________________________________
> OpenBabel-discuss mailing list
> OpenBabel-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

-- 
Miguel Quirós Olozábal
Departamento de Química Inorgánica. Facultad de Ciencias.
Universidad de Granada. 18071 Granada. SPAIN.
email: mquiros<at>ugr<dot>es
       mquiros<arroba>ugr<punto>es





------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to