After loading your dataset into a database on my linux machine, I'm
starting to wonder about my own answer below:

On Sat, Jul 17, 2010 at 6:02 AM, Greg Landrum <greg.land...@gmail.com> wrote:
>
> There are two parts here:
> 1) The RDKit does a lot of work when it reads a molecule, so it's 
> comparatively slow. I generally expect that it will spend 1-4 seconds per 
> thousand molecules (depending on cpu speed, obviously). Your set of 25K 
> molecules takes (on my macbook) around 6 seconds per thousand. If I break 
> that down by block, most of the time is spent on the first molecules:
> 8]>>> for i in range(0,len(s),1000):
>    ...:   t1=time.time()
>    ...:   ms=[s[x] for x in range(i,min(i+1000,len(s)))]
>    ...:   t2=time.time()
>    ...:   print i,'%.2f'%(t2-t1)
>    ...:
> 0 10.75
> 1000 17.78
> 2000 11.03
> 3000 11.01
> 4000 7.73
> 5000 5.08
> 6000 4.62
> 7000 5.14
> 8000 4.44
> 9000 4.08
> ...
> without looking at them, I suspect you have the larger and more complex 
> molecules at the beginning of the file? I will see if there are any real 
> outliers in the dataset that I can use to suggest further optimizations to 
> the molecule processing code.

I just re-ran this experiment on my linux box, which is not exactly
modern (4.5 years old, 2.8GHz Pentium D):
[5]>>> for i in range(0,len(s),1000):
   ...:     t1=time.time()
   ...:     ms=[s[x] for x in range(i,min(i+1000,len(s)))]
   ...:     t2=time.time()
   ...:     print i,'%.2f'%(t2-t1)
   ...:
   ...:
0 2.37
1000 4.10
2000 2.94
3000 2.85
4000 2.17
5000 1.49
6000 1.40
7000 1.52
...

These are numbers much more in line with what I expect. The resulting
database load takes a more reasonable amount of time (in my eyes):
tjtest=# \timing
Timing is on.
tjtest=# copy mols from '/home/glandrum/t.smi' delimiter ' ';
COPY 25855
Time: 44648.633 ms

And the indexing is also substantially faster than what you saw:
tjtest=# create index midx on mols using gist(m);
CREATE INDEX
Time: 119232.414 ms

Searches are also faster (and, please notice, now they're correct) :

tjtest=# select count(id) from mols where m  @> 'c1ccccc1C(=O)NC';
 count
-------
   546
(1 row)

Time: 380.455 ms

Could it be that either you built the rdkit in debug mode, your
machine is/was heavily loaded at the time you ran your tests, or  your
linux box is even older than mine?

Meanwhile, I need to go check on my macbook to figure out what
happened there; I guess I was using a debug build, because that's
normally faster than my linux box.

-greg

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to