Hi,

On Mon, Oct 11, 2010 at 2:42 AM, Craig A. James <cja...@emolecules.com> wrote:
> Here are the results of another 1.2 million SMILES "shuffle" tests.  Each 
> shuffle is 20 randomized versions of one SMILES (so 24 million 
> canonicalizations total).
>
> Most, maybe all of these are repeates of problems we've already encountered, 
> but I thought I'd include them for completeness.  I think this brings the 
> total number of molecules I've tested to around 3 million, well over half of 
> our database.
>
> It's hard to believe anybody really cares about some of these, but that's 
> what we're here for!

Thanks again for all the testing. I have several optimization commits
ready to push but I'm going to test these for regressions. I'll
probably test 1 million compounds from the eMolecules database (20x
shuffle).

In general, performance is good for relase now. The canonconsistent
test completes in 30 seconds here. Converting cansmi-roundtrip.smi to
can takes 16 seconds, to smi takes 8 so canonical coding takes about
50% of the total time when doing a file conversion.

I still need to figure out how to deal with metallocene compounds
where there are 8 or more neighbors with the same symmetry class. I
already have a hack to handle ferrocene but we might want to extend
this. IIRC, this might also help kekulization?

Metallocene: metal atom sandwiched between rings (4 or more atoms per ring)
Normalization: Remove bonds connecting metal to ring atoms without
increasing the number of disconnected fragments. Bonds will have to be
sorted using symmetry classes to always remove the same bonds.

This reduces the number of states for canonicalization dramatically.
This also makes the smiles nicer since all the closure digits can be
omitted.

C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71  -->  C1=CC(C=C1)[Fe]C1C=CC=C1

Does this sound like a reasonable solution?

These compounds were probably the cause of the 26GB RAM memory issue...

Tim

> Craig
>
> http://www.emolecules.com/image?db=549&id=4647317&width=500&height=500
> c12c(C3c4c(c1c...@h]3c(=O)N(C1=O)c1ncccc1)cccc4)cccc2   4647317
> c12c(C3c4c(c...@h]1c3c(=O)N(C1=O)c1ncccc1)cccc4)cccc2   4647317
>
> http://www.emolecules.com/image?db=549&id=4413024&width=500&height=500
> C12N(c...@h](SC1CCCC2)CCCC3)C(=O)C      4413024
> C12N(C3C(S[C@@H]1CCCC2)CCCC3)C(=O)C     4413024
>
> http://www.emolecules.com/image?db=549&id=4417557&width=500&height=500
> c1(c(cc(cc1)N1C(=O)C2[C@@H](C1=O)C1C(=C(C)C)C2C=C1)Cl)Cl        4417557
> c1(c(cc(cc1)N1C(=O)[C@@H]2C(C1=O)C1C(=C(C)C)C2C=C1)Cl)Cl        4417557
>
> http://www.emolecules.com/image?db=549&id=4417833&width=500&height=500
> N1(c2ccc(cc2)F)C(=O)C2[C@@H](C1=O)C1C(=C(C)C)C2C=C1     4417833
> N1(c2ccc(cc2)F)C(=O)[C@@H]2C(C1=O)C1C(=C(C)C)C2C=C1     4417833
>
> http://www.emolecules.com/image?db=549&id=4419082&width=500&height=500
> C12(c3c...@h](C1C=C3)C(=O)N(C4=O)c1c(cc(cc1C)C)C)CC2    4419082
> C12(c...@h]4c(C1C=C3)C(=O)N(C4=O)c1c(cc(cc1C)C)C)CC2    4419082
>
> http://www.emolecules.com/image?db=549&id=4419333&width=500&height=500
> N1(C(=O)C2[C@@H](C1=O)C1[C@@H]3C(C2C=C1)C(=O)N(C3=O)C1CC1)C1CC1 4419333
> N1(C(=O)[C@@H]2C(C1=O)C1C3[C@@H](C2C=C1)C(=O)N(C3=O)C1CC1)C1CC1 4419333
>
> http://www.emolecules.com/image?db=549&id=4420018&width=500&height=500
> c12c(C3c4c(c1c...@h]3c(=O)N(C1=O)NC(=O)c1ccc(cc1)Br)cccc4)cccc2 4420018
> c12c(C3c4c(c...@h]1c3c(=O)N(C1=O)NC(=O)c1ccc(cc1)Br)cccc4)cccc2 4420018
>
> http://www.emolecules.com/image?db=549&id=4421669&width=500&height=500
> C12(c3c...@h](C1C=C3)C(=O)N(C4=O)NC(=O)c1ccc(cc1)Br)CC2 4421669
> C12(c...@h]4c(C1C=C3)C(=O)N(C4=O)NC(=O)c1ccc(cc1)Br)CC2 4421669
>
> http://www.emolecules.com/image?db=549&id=4422415&width=500&height=500
> C1(=O)C2C3C(=C4C5CC6CC4CC(C5)C6)C([C@@H]2C(=O)N1c1cc(ccc1)C)C=C3        
> 4422415
> C1(=O)[...@h]2c3c(=C4C5CC6CC4CC(C5)C6)C(C2C(=O)N1c1cc(ccc1)C)C=C3 4422415
>
> http://www.emolecules.com/image?db=549&id=5782498&width=500&height=500
> C12(c...@h](O[C@@](O3)(C(O1)C(O2)C)C)C)C        5782498
> [...@]12(C3C(OC(O3)(C(O1)[...@h](O2)C)C)C)C 5782498
>
> http://www.emolecules.com/image?db=549&id=4842449&width=500&height=500
> c12c(C(=O)C(=C(C1=O)s...@h]1c([...@h]([C@@H](CO1)OC(=O)C)OC(=O)C)OC(=O)C)s...@h]1[c@@H]([...@h]([C@@H](CO1)OC(=O)C)OC(=O)C)OC(=O)C)cccc2
>       4842449
> c12c(C(=O)C(=C(C1=O)s...@h]1[c@@H]([...@h]([C@@H](CO1)OC(=O)C)OC(=O)C)OC(=O)C)s...@h]1c([...@h]([C@@H](CO1)OC(=O)C)OC(=O)C)OC(=O)C)cccc2
>       4842449
>
> http://www.emolecules.com/image?db=549&id=4782286&width=500&height=500
> [...@]12(CC(CN(C1)Cc1ccccc1)(CNC2)C)C     4782286
> C12(c...@](CN(C1)Cc1ccccc1)(CNC2)C)C     4782286
>
> http://www.emolecules.com/image?db=549&id=4785090&width=500&height=500
> n12c(c3c(n4c(c5c1CCCC5)nc1c(c4=O)cccc1)cccc3)nc1c(c2=O)cccc1    4785090
> n12c(=O)c3c(nc1c1c(n4c(c5c2cccc5)nc2c(c4=O)cccc2)cccc1)cccc3    4785090
> n12c(=O)c3c(nc1c1c(n4c(c5c2cccc5)nc2c(c4=O)CCCC2)cccc1)cccc3    4785090
>
> http://www.emolecules.com/image?db=549&id=5860502&width=500&height=500
> c12=c3c(c4c(c5c(c1csc2)csc5)csc4)csc3   5860502
> C12C(C3C(C4C(C5C1CSC5)CSC4)CSC3)CSC2    5860502
>
> /tmp/babel29542_1505.smi
> http://www.emolecules.com/image?db=549&id=5860663&width=500&height=500
> c12c3c4c5c(ccc4c4c(c3ccc1cccc2)nc1c(n4)cc2c(c1)cccc2)cccc5      5860663
> c12c3c(ccc2c2c(c4c1c1c(cc4)CCCC1)nc1c(n2)cc2c(c1)cccc2)cccc3    5860663
>
> http://www.emolecules.com/image?db=549&id=5860665&width=500&height=500
> c12c3c4c(ccc3c3c(c2ccc2c1cccc2)nc1c(n3)c2c(c3c1ccc1c3cccc1)c1c(cc2)cccc1)cccc4
>   5860665
> c12c3c4c(ccc3c3c(c2ccc2c1cccc2)nc1c(n3)c2c(c3c1ccc1c3CCCC1)c1c(cc2)cccc1)cccc4
>   5860665
> c12c3c4c(ccc3c3c(c2ccc2c1cccc2)nc1c(n3)c2c(c3c1ccc1c3CCCC1)c1c(cc2)CCCC1)cccc4
>   5860665
> c12c3c4c(ccc3c3c(c2ccc2c1cccc2)nc1c(n3)c2c(c3c1CCc1c3cccc1)c1c(cc2)cccc1)cccc4
>   5860665
> c12c3c(c4c(c5c3c3c(CC5)cccc3)nc3c(n4)c4c(c5c3ccc3c5cccc3)c3c(CC4)cccc3)ccc1cccc2
>         5860665
> c12c3c(ccc1c1c(c4c2c2c(CC4)cccc2)nc2c(n1)c1c(c4c2ccc2c4CCCC2)c2c(cc1)CCCC2)cccc3
>         5860665
> c12c3c(ccc2c2c(c4c1c1c(cc4)CCCC1)nc1c(n2)c2c(c4c1ccc1c4CCCC1)c1c(CC2)cccc1)cccc3
>         5860665
> c12c3c(ccc2c2c(c4c1c1c(cc4)CCCC1)nc1c(n2)c2c(c4c1CCc1c4cccc1)c1c(cc2)cccc1)cccc3
>         5860665
> c12c3c(ccc2c2c(c4c1c1c(cc4)CCCC1)nc1c(n2)c2c(c4c1CCc1c4cccc1)c1c(cc2)CCCC1)cccc3
>         5860665
>
> http://www.emolecules.com/image?db=549&id=6137697&width=500&height=500
> c12=c(nn2)ssnc1S        6137697
> c12c(nn2)ssnc1S 6137697
>
> http://www.emolecules.com/image?db=549&id=5863122&width=500&height=500
> C1([N+](=O)[O-])C2C[C@@h]3...@h]1c[c@H](C2)C3   5863122
> C1([N+](=O)[O-])[C@@H]2C[C@@h]3cc1...@h](C2)C3  5863122
>
> http://www.emolecules.com/image?db=549&id=5865030&width=500&height=500
> c12c3c4c5c6c7c8c(ccc7ccc6ccc5ccc4ccc3ccc1cccc2)cccc8    5865030
> c12c3c(ccc2ccc2c1c1c4c5c6c(ccc5ccc4ccc1cc2)CCCC6)cccc3  5865030
>
> http://www.emolecules.com/image?db=549&id=5865292&width=500&height=500
> c12c3c4c5c6c7c8c9c(ccc8ccc7ccc6ccc5ccc4ccc3ccc1cccc2)cccc9      5865292
> c12c3c4c5c6c7c8c9c(CCc8ccc7ccc6ccc5ccc4ccc3ccc1cccc2)cccc9      5865292
> c12c3c4c(ccc3ccc2ccc2c1c1c3c5c6c(ccc5ccc3ccc1cc2)CCCC6)CCCC4    5865292
> c12c3c(ccc2ccc2c1c1c4c5c6c7c(ccc6ccc5ccc4ccc1cc2)CCCC7)cccc3    5865292
>
> http://www.emolecules.com/image?db=549&id=5865338&width=500&height=500
> C(C(N(C)C)C)(c1ccccc1)o...@h]([C@@H](N(C)C)C)(c1ccccc1)O        5865338
> [...@h]([C@@H](N(C)C)C)(c1ccccc1)O.C(C(N(C)C)C)(c1ccccc1)O        5865338
>
> http://www.emolecules.com/image?db=549&id=5865516&width=500&height=500
> c12c3c4c5c6c(c7c8c9c%10c(CCc9cc(c8ccc7cc6)Br)cccc%10)ccc5ccc4c(cc3ccc1cccc2)Br
>   5865516
> c12c3c4c5c(c6c7c8c9c(CCc8cc(c7ccc6cc5)Br)cccc9)ccc4ccc3c(cc2ccc2c1CCCC2)Br    
>   5865516
> c12c3c(ccc2cc(c2c1c1c4c(c5c6C7c8c(CCC7CC(c6ccc5cc4)Br)cccc8)ccc1cc2)Br)cccc3  
>   5865516
>
> http://www.emolecules.com/image?db=549&id=6215783&width=500&height=500
> [Mn]12345678(C9(C7(C4(C1(C89C)C)C)C)C)C1(C5(C3(C2(C61C)C)C)C)C  6215783
> [Mn]12345678(C9(C7(C6(C1(C89C)C)C)C)C)C1(C4(C3(C2(C51C)C)C)C)C  6215783
>
> http://www.emolecules.com/image?db=549&id=8294208&width=500&height=500
> C1(c2ccccc2)OC([...@h]2[c@@H](COC(O2)c2ccccc2)O)[C@@H](CO1)O      8294208
> C1(c2ccccc2)O[C@@H](C2[C@@H](COC(O2)c2ccccc2)O)[C@@H](CO1)O     8294208
>
> http://www.emolecules.com/image?db=549&id=8622926&width=500&height=500
> c1(C(=O)Nc2nc[nH]n2)cc(ccc1)C.c1(C(=O)Nc2[nH]cnn2)cc(ccc1)C     8622926
> c1(C(=O)Nc2[nH]cnn2)cc(ccc1)C.c1(C(=O)Nc2nc[nH]n2)cc(ccc1)C     8622926
>
> http://www.emolecules.com/image?db=549&id=10434721&width=500&height=500
> C([C@@H](C(=O)O)O)(C(=O)O)O     10434721
> [...@h](C(C(=O)O)O)(C(=O)O)O      10434721
>
> http://www.emolecules.com/image?db=549&id=11467042&width=500&height=500
> [...@]12([Ce]3456789%10%11%12%13(c1c%13c%1...@h]23)([C@@]1(C6C9C8[C@@H]71)C(C)C)[C@@]1([C@@H]4[C@@H]%11C%10C51)C(C)C)C(C)C
>         11467042
> [C@@]12([Ce]3456789%10%11%12%13(c1c%1...@h]%12[c@@H]23)([C@@]1(C6C9C8[C@@H]71)C(C)C)[C@@]1(C4C%11C%10[C@@H]51)C(C)C)C(C)C
>        11467042
> [C@@]12([Ce]3456789%10%11%12%13([C@@H]1C%13C%12C23)([C@@]1(C6C9C8[C@@H]71)C(C)C)[C@@]1([C@@H]4[C@@H]%11C%10C51)C(C)C)C(C)C
>       11467042
> [...@]12([Ce]3456789%10%11%12%13([...@h]1[c@@H]%13C%12C23)([C@@]1(C6C9C8[C@@H]71)C(C)C)[C@@]1(C4C%11C%10[C@@H]51)C(C)C)C(C)C
>         11467042
>
> http://www.emolecules.com/image?db=549&id=11467044&width=500&height=500
> [...@]12([Ce]3456789%10%11%12%13(c1c%13c%1...@h]23)([C@@]1(C6C9C8[C@@H]71)CC)[C@@]1([C@@H]4[C@@H]%11C%10C51)CC)CC
>  11467044
> [...@]12([Ce]3456789%10%11%12%13([...@h]1[c@@H]%13C%12C23)([C@@]1(C6C9C8[C@@H]71)CC)[C@@]1(C4C%11C%10[C@@H]51)CC)CC
>  11467044
>
> http://www.emolecules.com/image?db=549&id=11467050&width=500&height=500
> [Hf]12345678([C@@]9(c7c4c...@h]89)C(C)C)([C@@]1([C@@H]5[C@@H]3C2C61)C(C)C)(Cl)Cl
>         11467050
> [Hf]12345678([C@@]9([C@@H]7[C@@H]4C1C89)C(C)C)([C@@]1(c5c3c...@h]61)C(C)C)(Cl)Cl
>         11467050
>
> http://www.emolecules.com/image?db=549&id=13170715&width=500&height=500
> S(=O)(=O)(c1ccc(cc1)C)NN=c1c2c1cccc2    13170715
> S(=O)(=O)(c1ccc(cc1)C)NN=C1C2=C1C=CC=C2 13170715
>
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today.
> http://p.sf.net/sfu/beautyoftheweb
> _______________________________________________
> OpenBabel-Devel mailing list
> OpenBabel-Devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openbabel-devel
>

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Reply via email to