After loading your dataset into a database on my linux machine, I'm starting to wonder about my own answer below:
On Sat, Jul 17, 2010 at 6:02 AM, Greg Landrum <greg.land...@gmail.com> wrote: > > There are two parts here: > 1) The RDKit does a lot of work when it reads a molecule, so it's > comparatively slow. I generally expect that it will spend 1-4 seconds per > thousand molecules (depending on cpu speed, obviously). Your set of 25K > molecules takes (on my macbook) around 6 seconds per thousand. If I break > that down by block, most of the time is spent on the first molecules: > 8]>>> for i in range(0,len(s),1000): > ...: t1=time.time() > ...: ms=[s[x] for x in range(i,min(i+1000,len(s)))] > ...: t2=time.time() > ...: print i,'%.2f'%(t2-t1) > ...: > 0 10.75 > 1000 17.78 > 2000 11.03 > 3000 11.01 > 4000 7.73 > 5000 5.08 > 6000 4.62 > 7000 5.14 > 8000 4.44 > 9000 4.08 > ... > without looking at them, I suspect you have the larger and more complex > molecules at the beginning of the file? I will see if there are any real > outliers in the dataset that I can use to suggest further optimizations to > the molecule processing code. I just re-ran this experiment on my linux box, which is not exactly modern (4.5 years old, 2.8GHz Pentium D): [5]>>> for i in range(0,len(s),1000): ...: t1=time.time() ...: ms=[s[x] for x in range(i,min(i+1000,len(s)))] ...: t2=time.time() ...: print i,'%.2f'%(t2-t1) ...: ...: 0 2.37 1000 4.10 2000 2.94 3000 2.85 4000 2.17 5000 1.49 6000 1.40 7000 1.52 ... These are numbers much more in line with what I expect. The resulting database load takes a more reasonable amount of time (in my eyes): tjtest=# \timing Timing is on. tjtest=# copy mols from '/home/glandrum/t.smi' delimiter ' '; COPY 25855 Time: 44648.633 ms And the indexing is also substantially faster than what you saw: tjtest=# create index midx on mols using gist(m); CREATE INDEX Time: 119232.414 ms Searches are also faster (and, please notice, now they're correct) : tjtest=# select count(id) from mols where m @> 'c1ccccc1C(=O)NC'; count ------- 546 (1 row) Time: 380.455 ms Could it be that either you built the rdkit in debug mode, your machine is/was heavily loaded at the time you ran your tests, or your linux box is even older than mine? Meanwhile, I need to go check on my macbook to figure out what happened there; I guess I was using a debug build, because that's normally faster than my linux box. -greg ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss