Hi, On Mon, Oct 11, 2010 at 2:42 AM, Craig A. James <cja...@emolecules.com> wrote: > Here are the results of another 1.2 million SMILES "shuffle" tests. Each > shuffle is 20 randomized versions of one SMILES (so 24 million > canonicalizations total). > > Most, maybe all of these are repeates of problems we've already encountered, > but I thought I'd include them for completeness. I think this brings the > total number of molecules I've tested to around 3 million, well over half of > our database. > > It's hard to believe anybody really cares about some of these, but that's > what we're here for!
Thanks again for all the testing. I have several optimization commits ready to push but I'm going to test these for regressions. I'll probably test 1 million compounds from the eMolecules database (20x shuffle). In general, performance is good for relase now. The canonconsistent test completes in 30 seconds here. Converting cansmi-roundtrip.smi to can takes 16 seconds, to smi takes 8 so canonical coding takes about 50% of the total time when doing a file conversion. I still need to figure out how to deal with metallocene compounds where there are 8 or more neighbors with the same symmetry class. I already have a hack to handle ferrocene but we might want to extend this. IIRC, this might also help kekulization? Metallocene: metal atom sandwiched between rings (4 or more atoms per ring) Normalization: Remove bonds connecting metal to ring atoms without increasing the number of disconnected fragments. Bonds will have to be sorted using symmetry classes to always remove the same bonds. This reduces the number of states for canonicalization dramatically. This also makes the smiles nicer since all the closure digits can be omitted. C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71 --> C1=CC(C=C1)[Fe]C1C=CC=C1 Does this sound like a reasonable solution? These compounds were probably the cause of the 26GB RAM memory issue... Tim > Craig > > http://www.emolecules.com/image?db=549&id=4647317&width=500&height=500 > c12c(C3c4c(c1c...@h]3c(=O)N(C1=O)c1ncccc1)cccc4)cccc2 4647317 > c12c(C3c4c(c...@h]1c3c(=O)N(C1=O)c1ncccc1)cccc4)cccc2 4647317 > > http://www.emolecules.com/image?db=549&id=4413024&width=500&height=500 > C12N(c...@h](SC1CCCC2)CCCC3)C(=O)C 4413024 > C12N(C3C(S[C@@H]1CCCC2)CCCC3)C(=O)C 4413024 > > http://www.emolecules.com/image?db=549&id=4417557&width=500&height=500 > c1(c(cc(cc1)N1C(=O)C2[C@@H](C1=O)C1C(=C(C)C)C2C=C1)Cl)Cl 4417557 > c1(c(cc(cc1)N1C(=O)[C@@H]2C(C1=O)C1C(=C(C)C)C2C=C1)Cl)Cl 4417557 > > http://www.emolecules.com/image?db=549&id=4417833&width=500&height=500 > N1(c2ccc(cc2)F)C(=O)C2[C@@H](C1=O)C1C(=C(C)C)C2C=C1 4417833 > N1(c2ccc(cc2)F)C(=O)[C@@H]2C(C1=O)C1C(=C(C)C)C2C=C1 4417833 > > http://www.emolecules.com/image?db=549&id=4419082&width=500&height=500 > C12(c3c...@h](C1C=C3)C(=O)N(C4=O)c1c(cc(cc1C)C)C)CC2 4419082 > C12(c...@h]4c(C1C=C3)C(=O)N(C4=O)c1c(cc(cc1C)C)C)CC2 4419082 > > http://www.emolecules.com/image?db=549&id=4419333&width=500&height=500 > N1(C(=O)C2[C@@H](C1=O)C1[C@@H]3C(C2C=C1)C(=O)N(C3=O)C1CC1)C1CC1 4419333 > N1(C(=O)[C@@H]2C(C1=O)C1C3[C@@H](C2C=C1)C(=O)N(C3=O)C1CC1)C1CC1 4419333 > > http://www.emolecules.com/image?db=549&id=4420018&width=500&height=500 > c12c(C3c4c(c1c...@h]3c(=O)N(C1=O)NC(=O)c1ccc(cc1)Br)cccc4)cccc2 4420018 > c12c(C3c4c(c...@h]1c3c(=O)N(C1=O)NC(=O)c1ccc(cc1)Br)cccc4)cccc2 4420018 > > http://www.emolecules.com/image?db=549&id=4421669&width=500&height=500 > C12(c3c...@h](C1C=C3)C(=O)N(C4=O)NC(=O)c1ccc(cc1)Br)CC2 4421669 > C12(c...@h]4c(C1C=C3)C(=O)N(C4=O)NC(=O)c1ccc(cc1)Br)CC2 4421669 > > http://www.emolecules.com/image?db=549&id=4422415&width=500&height=500 > C1(=O)C2C3C(=C4C5CC6CC4CC(C5)C6)C([C@@H]2C(=O)N1c1cc(ccc1)C)C=C3 > 4422415 > C1(=O)[...@h]2c3c(=C4C5CC6CC4CC(C5)C6)C(C2C(=O)N1c1cc(ccc1)C)C=C3 4422415 > > http://www.emolecules.com/image?db=549&id=5782498&width=500&height=500 > C12(c...@h](O[C@@](O3)(C(O1)C(O2)C)C)C)C 5782498 > [...@]12(C3C(OC(O3)(C(O1)[...@h](O2)C)C)C)C 5782498 > > http://www.emolecules.com/image?db=549&id=4842449&width=500&height=500 > c12c(C(=O)C(=C(C1=O)s...@h]1c([...@h]([C@@H](CO1)OC(=O)C)OC(=O)C)OC(=O)C)s...@h]1[c@@H]([...@h]([C@@H](CO1)OC(=O)C)OC(=O)C)OC(=O)C)cccc2 > 4842449 > c12c(C(=O)C(=C(C1=O)s...@h]1[c@@H]([...@h]([C@@H](CO1)OC(=O)C)OC(=O)C)OC(=O)C)s...@h]1c([...@h]([C@@H](CO1)OC(=O)C)OC(=O)C)OC(=O)C)cccc2 > 4842449 > > http://www.emolecules.com/image?db=549&id=4782286&width=500&height=500 > [...@]12(CC(CN(C1)Cc1ccccc1)(CNC2)C)C 4782286 > C12(c...@](CN(C1)Cc1ccccc1)(CNC2)C)C 4782286 > > http://www.emolecules.com/image?db=549&id=4785090&width=500&height=500 > n12c(c3c(n4c(c5c1CCCC5)nc1c(c4=O)cccc1)cccc3)nc1c(c2=O)cccc1 4785090 > n12c(=O)c3c(nc1c1c(n4c(c5c2cccc5)nc2c(c4=O)cccc2)cccc1)cccc3 4785090 > n12c(=O)c3c(nc1c1c(n4c(c5c2cccc5)nc2c(c4=O)CCCC2)cccc1)cccc3 4785090 > > http://www.emolecules.com/image?db=549&id=5860502&width=500&height=500 > c12=c3c(c4c(c5c(c1csc2)csc5)csc4)csc3 5860502 > C12C(C3C(C4C(C5C1CSC5)CSC4)CSC3)CSC2 5860502 > > /tmp/babel29542_1505.smi > http://www.emolecules.com/image?db=549&id=5860663&width=500&height=500 > c12c3c4c5c(ccc4c4c(c3ccc1cccc2)nc1c(n4)cc2c(c1)cccc2)cccc5 5860663 > c12c3c(ccc2c2c(c4c1c1c(cc4)CCCC1)nc1c(n2)cc2c(c1)cccc2)cccc3 5860663 > > http://www.emolecules.com/image?db=549&id=5860665&width=500&height=500 > c12c3c4c(ccc3c3c(c2ccc2c1cccc2)nc1c(n3)c2c(c3c1ccc1c3cccc1)c1c(cc2)cccc1)cccc4 > 5860665 > c12c3c4c(ccc3c3c(c2ccc2c1cccc2)nc1c(n3)c2c(c3c1ccc1c3CCCC1)c1c(cc2)cccc1)cccc4 > 5860665 > c12c3c4c(ccc3c3c(c2ccc2c1cccc2)nc1c(n3)c2c(c3c1ccc1c3CCCC1)c1c(cc2)CCCC1)cccc4 > 5860665 > c12c3c4c(ccc3c3c(c2ccc2c1cccc2)nc1c(n3)c2c(c3c1CCc1c3cccc1)c1c(cc2)cccc1)cccc4 > 5860665 > c12c3c(c4c(c5c3c3c(CC5)cccc3)nc3c(n4)c4c(c5c3ccc3c5cccc3)c3c(CC4)cccc3)ccc1cccc2 > 5860665 > c12c3c(ccc1c1c(c4c2c2c(CC4)cccc2)nc2c(n1)c1c(c4c2ccc2c4CCCC2)c2c(cc1)CCCC2)cccc3 > 5860665 > c12c3c(ccc2c2c(c4c1c1c(cc4)CCCC1)nc1c(n2)c2c(c4c1ccc1c4CCCC1)c1c(CC2)cccc1)cccc3 > 5860665 > c12c3c(ccc2c2c(c4c1c1c(cc4)CCCC1)nc1c(n2)c2c(c4c1CCc1c4cccc1)c1c(cc2)cccc1)cccc3 > 5860665 > c12c3c(ccc2c2c(c4c1c1c(cc4)CCCC1)nc1c(n2)c2c(c4c1CCc1c4cccc1)c1c(cc2)CCCC1)cccc3 > 5860665 > > http://www.emolecules.com/image?db=549&id=6137697&width=500&height=500 > c12=c(nn2)ssnc1S 6137697 > c12c(nn2)ssnc1S 6137697 > > http://www.emolecules.com/image?db=549&id=5863122&width=500&height=500 > C1([N+](=O)[O-])C2C[C@@h]3...@h]1c[c@H](C2)C3 5863122 > C1([N+](=O)[O-])[C@@H]2C[C@@h]3cc1...@h](C2)C3 5863122 > > http://www.emolecules.com/image?db=549&id=5865030&width=500&height=500 > c12c3c4c5c6c7c8c(ccc7ccc6ccc5ccc4ccc3ccc1cccc2)cccc8 5865030 > c12c3c(ccc2ccc2c1c1c4c5c6c(ccc5ccc4ccc1cc2)CCCC6)cccc3 5865030 > > http://www.emolecules.com/image?db=549&id=5865292&width=500&height=500 > c12c3c4c5c6c7c8c9c(ccc8ccc7ccc6ccc5ccc4ccc3ccc1cccc2)cccc9 5865292 > c12c3c4c5c6c7c8c9c(CCc8ccc7ccc6ccc5ccc4ccc3ccc1cccc2)cccc9 5865292 > c12c3c4c(ccc3ccc2ccc2c1c1c3c5c6c(ccc5ccc3ccc1cc2)CCCC6)CCCC4 5865292 > c12c3c(ccc2ccc2c1c1c4c5c6c7c(ccc6ccc5ccc4ccc1cc2)CCCC7)cccc3 5865292 > > http://www.emolecules.com/image?db=549&id=5865338&width=500&height=500 > C(C(N(C)C)C)(c1ccccc1)o...@h]([C@@H](N(C)C)C)(c1ccccc1)O 5865338 > [...@h]([C@@H](N(C)C)C)(c1ccccc1)O.C(C(N(C)C)C)(c1ccccc1)O 5865338 > > http://www.emolecules.com/image?db=549&id=5865516&width=500&height=500 > c12c3c4c5c6c(c7c8c9c%10c(CCc9cc(c8ccc7cc6)Br)cccc%10)ccc5ccc4c(cc3ccc1cccc2)Br > 5865516 > c12c3c4c5c(c6c7c8c9c(CCc8cc(c7ccc6cc5)Br)cccc9)ccc4ccc3c(cc2ccc2c1CCCC2)Br > 5865516 > c12c3c(ccc2cc(c2c1c1c4c(c5c6C7c8c(CCC7CC(c6ccc5cc4)Br)cccc8)ccc1cc2)Br)cccc3 > 5865516 > > http://www.emolecules.com/image?db=549&id=6215783&width=500&height=500 > [Mn]12345678(C9(C7(C4(C1(C89C)C)C)C)C)C1(C5(C3(C2(C61C)C)C)C)C 6215783 > [Mn]12345678(C9(C7(C6(C1(C89C)C)C)C)C)C1(C4(C3(C2(C51C)C)C)C)C 6215783 > > http://www.emolecules.com/image?db=549&id=8294208&width=500&height=500 > C1(c2ccccc2)OC([...@h]2[c@@H](COC(O2)c2ccccc2)O)[C@@H](CO1)O 8294208 > C1(c2ccccc2)O[C@@H](C2[C@@H](COC(O2)c2ccccc2)O)[C@@H](CO1)O 8294208 > > http://www.emolecules.com/image?db=549&id=8622926&width=500&height=500 > c1(C(=O)Nc2nc[nH]n2)cc(ccc1)C.c1(C(=O)Nc2[nH]cnn2)cc(ccc1)C 8622926 > c1(C(=O)Nc2[nH]cnn2)cc(ccc1)C.c1(C(=O)Nc2nc[nH]n2)cc(ccc1)C 8622926 > > http://www.emolecules.com/image?db=549&id=10434721&width=500&height=500 > C([C@@H](C(=O)O)O)(C(=O)O)O 10434721 > [...@h](C(C(=O)O)O)(C(=O)O)O 10434721 > > http://www.emolecules.com/image?db=549&id=11467042&width=500&height=500 > [...@]12([Ce]3456789%10%11%12%13(c1c%13c%1...@h]23)([C@@]1(C6C9C8[C@@H]71)C(C)C)[C@@]1([C@@H]4[C@@H]%11C%10C51)C(C)C)C(C)C > 11467042 > [C@@]12([Ce]3456789%10%11%12%13(c1c%1...@h]%12[c@@H]23)([C@@]1(C6C9C8[C@@H]71)C(C)C)[C@@]1(C4C%11C%10[C@@H]51)C(C)C)C(C)C > 11467042 > [C@@]12([Ce]3456789%10%11%12%13([C@@H]1C%13C%12C23)([C@@]1(C6C9C8[C@@H]71)C(C)C)[C@@]1([C@@H]4[C@@H]%11C%10C51)C(C)C)C(C)C > 11467042 > [...@]12([Ce]3456789%10%11%12%13([...@h]1[c@@H]%13C%12C23)([C@@]1(C6C9C8[C@@H]71)C(C)C)[C@@]1(C4C%11C%10[C@@H]51)C(C)C)C(C)C > 11467042 > > http://www.emolecules.com/image?db=549&id=11467044&width=500&height=500 > [...@]12([Ce]3456789%10%11%12%13(c1c%13c%1...@h]23)([C@@]1(C6C9C8[C@@H]71)CC)[C@@]1([C@@H]4[C@@H]%11C%10C51)CC)CC > 11467044 > [...@]12([Ce]3456789%10%11%12%13([...@h]1[c@@H]%13C%12C23)([C@@]1(C6C9C8[C@@H]71)CC)[C@@]1(C4C%11C%10[C@@H]51)CC)CC > 11467044 > > http://www.emolecules.com/image?db=549&id=11467050&width=500&height=500 > [Hf]12345678([C@@]9(c7c4c...@h]89)C(C)C)([C@@]1([C@@H]5[C@@H]3C2C61)C(C)C)(Cl)Cl > 11467050 > [Hf]12345678([C@@]9([C@@H]7[C@@H]4C1C89)C(C)C)([C@@]1(c5c3c...@h]61)C(C)C)(Cl)Cl > 11467050 > > http://www.emolecules.com/image?db=549&id=13170715&width=500&height=500 > S(=O)(=O)(c1ccc(cc1)C)NN=c1c2c1cccc2 13170715 > S(=O)(=O)(c1ccc(cc1)C)NN=C1C2=C1C=CC=C2 13170715 > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today. > http://p.sf.net/sfu/beautyoftheweb > _______________________________________________ > OpenBabel-Devel mailing list > OpenBabel-Devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/openbabel-devel > ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel