Hi TJ,

On Fri, Jul 16, 2010 at 4:02 PM, TJ O'Donnell <t...@acm.org> wrote:

>  I'm having a good time playing with your new postgres cartridge.  I've
> run into a few problems I thought you could help with.
>
> First a summary of what I did, then a few questions.  I'm using postgres
> 8.4.4 on linux, your latest rdkit and cartridge from svn,
>
> as of last week.
>
>
> Create table rdmol (id integer, smiles text, m mol, mx mol)
>
> id and smiles from drugbank and first 22K of pubchem. see attached smi file
>
>
>
> update rdmol set m=smiles::mol
>
> 25855
>
> 171,693.016 ms
>
> update rdmol set mx=m
>
> 25855
>
> 1,512.227 ms
>
> create index molidx on rdmol using gist(m);
>
>
>  730,077.537 ms
>
> select count(id) from rdmol where mx @> 'c1ccccc1C(=O)NC'
>
> 546
>
> 24,224.787 ms
>
> select count(id) from rdmol where m  @> 'c1ccccc1C(=O)NC'
>
> 399
>
> 570.539 ms
>
>
>
> Is this slow speed to be expected when creating mol from smiles and
> gist(m)?
>
> There are two parts here:

1) The RDKit does a lot of work when it reads a molecule, so it's
comparatively slow. I generally expect that it will spend 1-4 seconds per
thousand molecules (depending on cpu speed, obviously). Your set of 25K
molecules takes (on my macbook) around 6 seconds per thousand. If I break
that down by block, most of the time is spent on the first molecules:
8]>>> for i in range(0,len(s),1000):
   ...:   t1=time.time()
   ...:   ms=[s[x] for x in range(i,min(i+1000,len(s)))]
   ...:   t2=time.time()
   ...:   print i,'%.2f'%(t2-t1)
   ...:
0 10.75
1000 17.78
2000 11.03
3000 11.01
4000 7.73
5000 5.08
6000 4.62
7000 5.14
8000 4.44
9000 4.08
...
without looking at them, I suspect you have the larger and more complex
molecules at the beginning of the file? I will see if there are any real
outliers in the dataset that I can use to suggest further optimizations to
the molecule processing code.

2) molecule indexing speed: this is determined by the speed (really the lack
thereof) of the layered fingerprinting code, which is slow. The
fingerprinter enumerates all (branched and unbranched) molecular paths
containing from 1-7 bonds and hashes them. The inclusion of branched paths
makes the process slower, but (I believe) improves the screenout rate of the
fingerprint. There is a good amount of work left to be done on improving the
fingerprinter and it utility SSS. We use a different fingerprint at work for
the index, so I haven't spent much time on this stuff.


> More troubling is why are fewer superstructures found when the gist index
> is used?
>
>
>
> select smiles from rdmol where mx @> 'c1ccccc1C(=O)NC'
>
>  except
>
> select smiles from rdmol where m  @> ' c1ccccc1C(=O)NC '
>
>
>
> 147 rows
>
> c1ccc(C(Nc2cc(Cl)ccc2OCC(O)=O)=O)cc1
>
> CCN1C(=O)c2ccccc2C1=O
>
> O=C(Nc1ccc(Br)cc1)c1c(O)c(Br)cc(Br)c1
>
>
Not good! not good at all. I'm willing to live with somewhat slower code for
preprocessing steps, but the results definitely should be correct. There's
clearly a parameter problem somewhere that's giving rise to this. I suspect
I know what it is and will fix it. Thanks for pointing this out!

Best Regards,
-greg
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to