Hello. Sorry if it has taken me too long (almost one month to continue with this thread!) to elaborate a list of files with representative examples to illustrate OpenBabel performance on inorganic compounds. I have not found till now time for doing this. As you can see (I prevent you!) the message is very long even if it only deals with nine examples.
I have selected CIF files of the Crystallography Open Database trying to focus each example in a single particular problem, avoiding examples with more than one simultaneous problem and also files with "crystallographic" problems (symmetry, disorder, poor or incomplete data, ill-formatted files, ...). The CIF files used for the tests may be downloaded from COD using the URL's http://www.crystallography.net/xxxxxxx.cif (xxxxxxx is the numeric identifier used for each structure in COD). For each file, I perform the command "babel -aB xxxxxxx.cif -osmi", using openbabel 2.2.3 and openbabel 2.3.2. The "-aB" flag is used to include in the molecular model all bonds listed by the authors in the CIF files, since without it openbabel sometimes leaves some bonds out (I think that the maximum number of bonds without the "-aB" flag is too short in many cases). This approach has the disadvantage that, in some cases, authors list in the CIF distances that may be interesting for them but that are not "bonds" thus introducing spurious bonds in the result but, in most cases, "-aB" option gives more satisfactory results. This only applies to version 2.2.3. Version 2.3.2 seems to ignore the "-aB" flag. After quoting openbabel output, I indicate what I think it must be the correct result, taking into account that the definition of "bond" or "bond order" in inorganic chemistry is not as well established as in organic chemistry and some of "my" representations may be sub judice. I also write a short comment about the result for each example. 2008819.cif (a pyridine complex). Babel 2.2.3 output: [Os](Cl)(F)(F)([n]1ccccc1)([n]1ccccc1)[n]1ccccc1 Babel 2.3.2 output: [Os](Cl)(F)(F)(N1C=CCC=C1)(N1C=CCC=C1)N1C=CCC=C1 Expected: [Os](Cl)(F)(F)([n]1ccccc1)([n]1ccccc1)[n]1ccccc1 * Version 2.2.3 got it right whereas version 2.3.2 "dearomatizes" pyridine and inserts an spurious H atom in one of the carbon atoms of the ring trying to keep valence 3 for nitrogen at all costs. 2227419.cif (a bipyridine complex). Babel 2.2.3 output: [Pt]1(Br)(Br)(Br)(Br)[n]2ccccc2c2[n]1cccc2 Babel 2.3.2 output: [Pt]1(Br)(Br)(Br)(Br)N2C=CCC=C2[C@@H]2N1C=CC=C2 Expected: [Pt]1(Br)(Br)(Br)(Br)[n]2ccccc2c2[n]1cccc2 * Similar to the previous example. The spurious H-atom introduced by 2.3.2 in one of the rings also implies some imaginary chirality. 2223192.cif (a phenanthroline complex). Babel 2.2.3 output: [Mo]1(F)(F)([O])([O])[n]2cccc3c2c2[n]1cccc2cc3 Babel 2.3.2 output: [Mo]1(F)(F)([O])([O])n2cccc3c2c2n1cccc2cc3 Expected: [Mo]1(F)(F)(=O)(=O)[n]2cccc3c2c2[n]1cccc2cc3 * Version 2.2.3 got the phenanthroline right and 2.3.2 does not put brackets in the nitrogens. I think brackets are necessary since nitrogen is not using its standard valence. Nevertheless, no spurious H atoms are added in this case. Babel does not regard the molybdenum-oxygen bond as double, probably this is quite difficult to spot. 8100257.cif (a phosphane complex). Babel output (both versions): [Ru](Cl)(Cl)(P(c1ccccc1)(c1ccccc1)c1ccccc1)(P(c1ccccc1)(c1ccccc1)c1ccccc1)P(c1ccccc1)(c1ccccc1)c1ccccc1 Expected: [Ru](Cl)(Cl)([P](c1ccccc1)(c1ccccc1)c1ccccc1)([P](c1ccccc1)(c1ccccc1)c1ccccc1)[P](c1ccccc1)(c1ccccc1)c1ccccc1 * Brackets in phosphorus are required. In SMILES specification, standard valences for phosphorus are 3 and 5, hence not including the brackets means adding a spurious implicit H-atom attached to each phosphorus. 7007515.cif (an acetylacetonato complex). Babel 2.2.3 output: [Pb]12(O[C@H](C)C=C(C)O1)O[C@@H](C)C=C(C)O2 Babel 2.3.2 output: [Pb@]12(O[C](C)C=C(C)O1)O[C](C)C=C(C)O2 Expected: [Pb]12([O]=C(C)C=C(C)O1)[O]=C(C)C=C(C)O2 * Both version try to keep valence two for all oxygen atoms, 2.2.3 add spurious H atoms (with invented chirality) to one of the C atoms attached to oxygen, whereas 2.3.2 regard these C atoms as radical trivalent centres. I think that the best representation is writing one of the two possible resonance forms of acetylacetonate, with C=O at one side and C-O at the other. 2228718.cif (an imino complex). Babel 2.2.3 output: [Cu@]12(Cl)Oc3ccccc3[C@H](N1CC[N]12CCOCC1)C Babel 2.3.2 output: [Cu@]12(Cl)Oc3ccccc3[C](N1CC[N]12CCOCC1)C Expected: [Cu]12(Cl)Oc3ccccc3C(=[N]1CC[N]12CCOCC1)C * Similar to previous example. Trying to keep valence 3 for the imino N atom, 2.2.3 add a spurious H-atom and 2.3.2 set the imino carbon as a radical. The chiral mark at Cu is correct for a individual molecule but the crystal is racemic and hence, it is more correct to remove it. 7105215.cif (an ether complex). Babel 2.2.3 output: [Th]12([Cl-])(Cl)([Cl-])(Cl)([O@H](C)CC[O@H]1C)[O@H](C)CC[O@H]2C Babel 2.3.2 output: [Th](Cl)Cl.[Cl-].[Cl-].O(C)CCOC.O(C)CCOC Expected: [Th]12(Cl)(Cl)(Cl)(Cl)([O](C)CC[O]1C)[O](C)CC[O]2C * Version 2.2.3 binds spurious H-atoms to oxygen (why it is preferred an oxygen with valence 4 to an oxygen with valence 3??). Also, two chlorides appear as "Cl" and the other two as "[Cl-]". Version 2.3.2 ignores some Th-Cl and Th-O bonds leaving some disconnected moieties: probably Th-Cl and Th-O distances are too large to be considered by Openbabel as "bonds" (Th is a large atom!!) and clearly ignores the "-aB" flag. Using 2.2.3 without the "-aB" flag yields the same result than 2.3.2, regardless if "-aB" is or is not present with the latter. 2004668.cif (a closo carborane). Babel 2.2.3 output: [C]1234([CH]567B891B1%102B2%113B345B45%11B%11% 102B291B168B734B5%1121)c1ccccc1 Babel 2.3.2 output: [C@]12([C@H]3[B@H]4BB[B@@H]1[B@H]1[B@@H]2BB[B@@H]3[B@@H]41)c1ccccc1 Expected: [C]1234([CH]567[BH]891[BH]1%102[BH]2%113[BH]345[BH]45%11[BH]% 11%102[BH]291[BH]168[BH]734[BH]5%1121)c1ccccc1 * Boron and carbon are not using their standard valence (both form 6 bonds!) so the use of brackets and the explicit indication of the number of hydrogens is compulsory. In version 2.3.2 a lot of edges of the icosahedron are missing, even if the 30 bonds are listed in the CIF file: apparently, babel 2.3.2 limits the bonds of C and B to four and ignores again the "-aB" flag. 1504361.cif (dimethylferrocene) Babel 2.2.3 output: [Fe]12345678([CH]9=[CH]1[C]2(=[CH]4[C@@H]59)C)[C]1(=[CH]6[C@H]7[CH]8=[CH]31)C Babel 2.3.2 output: [Fe]([C@H]1C=CC(=C1)C)[C@H]1C=C(C=C1)C Expected: [Fe]12345678([cH]9[cH]1[c]2([cH]4[cH]59)C)[c]1([cH]6[cH]7[cH]8[cH]31)C * 2.2.3 displays some sort of "kekulized" version of ferrocene that is not too bad, except perhaps by considering one of the carbon of each rings as sp3 with indication of the chirality (strictly speaking, the carbon atoms not bearing the methyl could be regarded as "asymmetric"). 2.3.2 again ignores the "-aB" flag and links iron to the rings only through one C atom which for sure is wrong. Another test I have made is to use as input supposedly "correct" SMILES chains (hence, those listed as "expected" in the preceding paragraphs) and check if the chain remains unchanged when piped through openbabel (command babel xxxxxxx.smi -osmi, the input file containing the SMILES chain). Expected result is input = output. 2008819 (a pyridine complex). Input: [Os](Cl)(F)(F)([n]1ccccc1)([n]1ccccc1)[n]1ccccc1 Babel 2.2.3 output: [Os](Cl)(F)(F)([n]1ccccc1)([n]1ccccc1)[n]1ccccc1 Babel 2.3.2 output: [Os](Cl)(F)(F)(N1CCCCC1)(N1CCCCC1)N1CCCCC1 * 2.3.2 fully dearomatizes pyridine and convert all CH into CH2. 2.2.3 is OK. 2227419 (a bipyridine complex). Input: [Pt]1(Br)(Br)(Br)(Br)[n]2ccccc2c2[n]1cccc2 Babel 2.2.3 output: [Pt]1(Br)(Br)(Br)(Br)[n]2ccccc2c2[n]1cccc2 Babel 2.3.2 output: [Pt]1(Br)(Br)(Br)(Br)N2CCCCC2C2N1CCCC2 * Same comment as previous. 2223192 (a phenanthroline complex). Input: [Mo]1(F)(F)(=O)(=O)[n]2cccc3c2c2[n]1cccc2cc3 Babel 2.2.3 output: [Mo]1(=O)(=O)(F)(F)[n]2cccc3c2c2[n]1cccc2cc3 Babel 2.3.2 output: [Mo]1(=O)(=O)(F)(F)n2cccc3c2c2n1cccc2cc3 * 2.2.3 is OK. 2.3.2 insists in removing brackets. 8100257 (a phosphane complex). Input: [Ru](Cl)(Cl)([P](c1ccccc1)(c1ccccc1)c1ccccc1)([P](c1ccccc1)(c1ccccc1)c1ccccc1)[P](c1ccccc1)(c1ccccc1)c1ccccc1 Babel output (both versions): [Ru](Cl)(Cl)(P(c1ccccc1)(c1ccccc1)c1ccccc1)(P(c1ccccc1)(c1ccccc1)c1ccccc1)P(c1ccccc1)(c1ccccc1)c1ccccc1 * Babel insists in removing brackets and hence, adding spurious hydrogens. 7007515 (an acetylacetonato complex). Input: [Pb]12([O]=C(C)C=C(C)O1)[O]=C(C)C=C(C)O2 * Both versions keeps this chain unchanged, so this example is OK. 2228718 (an imino complex). Input: [Cu]12(Cl)Oc3ccccc3C(=[N]1CC[N]12CCOCC1)C * Both versions keeps this chain unchanged, so this example is OK. 7105215 (an ether complex). Input: [Th]12(Cl)(Cl)(Cl)(Cl)([O](C)CC[O]1C)[O](C)CC[O]2C * Both versions keeps this chain unchanged, so this example is OK. 2004668 (a closo carborane). Input: [C]1234([CH]567[BH]891[BH]1%102[BH]2%113[BH]345[BH]45%11[BH]%11% 102[BH]291[BH]168[BH]734[BH]5%1121)c1ccccc1 Babel 2.2.3 output: [C]1234([CH]567B891B1%102B2%113B345B452B21%11B18% 10B869B734B5218)c1ccccc1 Babel 2.3.2 output: [C]1234([CH]567[BH]891[BH]1%102[BH]2% 113[BH]345[BH]452[BH]21%11[BH]18%10[BH]869[BH]734[BH]5218)c1ccccc1 * 2.2.3 removes brackets and H-atoms from boron, which is not correct. 2.3.2 does it right, this is the only test in which 2.3.2 has performed better than 2.2.3. 1504361.cif (dimethylferrocene) Input: [Fe]12345678([cH]9[cH]1[c]2([cH]4[cH]59)C)[c]1([cH]6[cH]7[cH]8[cH]31)C Babel output (both versions): [Fe]12345678(C9C1C2(C3C49)C)C1(C5C6C7C81)C * Babel simply consider the carbons as non-aromatic, as they are forming four bonds the output at least keeps correct the hydrogen count. Looking at SMILES format specifications (can ferrocene be considered as "informatically non-aromatic" even if chemically for sure is aromatic??), I am not able to tell if the output is correct or not. I think these examples are a rather representative sample to study and try to improve the performance of openbabel with inorganic compounds. Thanks a lot for your interest in this subject. Best wishes, Miguel Quirós El jue, 19-12-2013 a las 10:17 +0100, Miguel Quirós Olozábal escribió: > Thanks a lot for your message. > > It probably will take some time to prepare such set. I would like to > include files containing a single different problem each one, with > compounds as simple as possible and without purely crystallographic > problems (molecules in symmetry elements, disorder, ...). All these to > avoid the concurrence of several problems of different nature in the > same file. > The set should be representative of the problems more frequently found > and also not be too large. > > When I am satisfied with the selection, I will forward it to you. > > Best wishes, > Miguel Quirós > > > El mié, 18-12-2013 a las 10:06 -0500, Geoffrey Hutchison escribió: > > That's a pretty bad regression, and I will investigate the two examples you > > sent. > > > > Certainly if you can prepare a test set (in whatever format) that would be > > extremely helpful, since it could be added as a unit test. Not only would > > this ensure all such examples will be fixed, but future versions will need > > to ensure they pass. > > > > I'd actually be very interested in such a set for other reasons, since the > > gen3d builder and other parts of he code (UFF) need similar testing on > > inorganic compounds. > > > > You can either send the CIFs to me personally, or provide entries into the > > COD, since I can script the downloads. > > > > As I said before, I really want to know of these types of bugs (inorganics, > > but also any type of changes from one release to another). > > > > Thanks, > > Geoff > > > > > On Dec 14, 2013, at 1:26 AM, mquiros <mqui...@ugr.es> wrote > > > El 13/12/2013 22:16, Geoffrey Hutchison escribió: > > >>> I need to review and, in most cases, fix the SMILES chains coming out > > >>> from OpenBabel for inorganic compounds (either manually or > > >>> semiautomatically). I am also stuck to version 2.2.3 because versions > > >>> newer than this perform worse for inorganic compounds. > > >> > > >> If you can give some bug reports or somewhat more detailed > > >> descriptions of the problems we'd obviously really appreciate it. I > > >> suspect many of these issues can be resolved, but if we're operating > > >> in a vacuum, it's hard to know what bugs exist. For example, we're > > >> firming up plans for v2.4, and obviously, I'd prefer to have improved > > >> inorganic / organometallic support. > > >> > > >> As you say, some issues are inevitable, since there is a mismatch > > >> between inorganic / organometallic bonding and the valence bond model, > > >> but that doesn't mean we can't aim to represent things well. For > > >> example, the latest development code has improved support for "zero > > >> order" bonds, including an extension to the SD file format. > > >> > > >> All of these discussions are quite productive, thanks! > > >> -Geoff > > > > > > Hello. > > > > > > Thanks a lot for your interest. > > > > > > I think I have already provided examples in previous posts, but if you > > > want just a couple of quick examples, ferrocene and a pyridine complex > > > (tetrakispyridine copper(II) chloride) with just SMILES -> SMILES > > > conversion (the "inorganic problems" are not a CIF format problem but a > > > more general one). I have prepared the following files with a single line: > > > > > > cupyrcl.smi: > > > [Cu]([n]1ccccc1)([n]1ccccc1)([n]1ccccc1)[n]1ccccc1.[Cl-].[Cl-] > > > Cupyr4Cl2 > > > > > > ferrocene.smi: > > > [Fe]12345678([cH]9[cH]1[cH]2[cH]3[cH]49)[cH]1[cH]5[cH]6[cH]7[cH]18 > > > ferrocene > > > > > > If I perform "babel cupyrcl.smi -osmi" and "babel ferrocene.smi -osmi", I > > > expect the output to be equal to the input (or perhaps just changing the > > > order of atoms). > > > > > > In the first example, I got it right with babel 2.2.3 but with babel > > > 2.3.2, the output is very wrong: > > > [Cu](N1CCCCC1)(N1CCCCC1)(N1CCCCC1)N1CCCCC1.[Cl-].[Cl-] Cupyr4Cl2 > > > Full conversion of pyridine into piperidinato, all CH changed to CH2. > > > Babel 2.3.2 wants to keep valence 3 for nitrogen even at the cost of > > > completely corrupting the whole heterocycle. > > > > > > For ferrocene, with any of the two versions, the output is: > > > [Fe]12345678(C9C1C2C3C49)C1C5C6C7C81 ferrocene > > > Again dearomatization, the hydrogen count is however correct in this > > > case. But I want the conversions to perform substructure search and any > > > inorganic chemist looking for ferrocene derivatives will regard ferrocene > > > as an aromatic compound and the search will fail. > > > > > > I can provide thousands of examples: metalocenes, metal carbonyls, > > > phosphane complexes, imino complexes, boranes and carboranes, > > > acetylacetonato complexes and a long etcetera that include the vast > > > majority of metal-organic and organometallic compounds. Perhaps I can > > > prepare a bunch of CIF files including at least one belonging to each of > > > the most oustanding inorganic families to use as a test bench. > > > > > > Thanks again. Best wishes, > > > Miguel Quirós > -- Miguel Quirós Olozábal Departamento de Química Inorgánica. Facultad de Ciencias. Universidad de Granada. 18071 Granada. SPAIN. email: mquiros<at>ugr<dot>es mquiros<arroba>ugr<punto>es ------------------------------------------------------------------------------ CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments & Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss