On Wed, May 4, 2011 at 04:26, Greg Landrum <greg.land...@gmail.com> wrote: > Dear JP, > > On Tue, May 3, 2011 at 6:29 PM, JP <jeanpaul.ebe...@inhibox.com> wrote: >> >> Based on an rdkit post I read over the warm weekend I set myself to >> have a look at my rdkit based queries (and ways to speed them up)... >> >> But first some details: >> >> Postgres: >> PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc >> (GCC) 4.5.2, 64-bit >> >> RDKit (DB Cartridge): >> v. 0.20.0 >> > > The other important question is how much memory you have and what > filesystem postgres is using for the database (local or network). > >> # of Molecules >> 8,432,896 >> >> Database table (ligand) >> Table "public.ligand" >> Column | Type | Modifiers >> ------------+-----------------------+----------------------------------------------------- >> id | integer | not null default >> nextval('ligand_id_seq'::regclass) >> supplierid | character varying(50) | >> smiles | text | >> rdkitmol | mol | >> pairbv | bfp | >> torsionbv | bfp | >> morganbv | bfp | >> amw | real | >> mollogp | real | >> hba | integer | >> hbd | integer | >> atoms | integer | >> hvyatoms | integer | >> Indexes: >> "ligand_pkey" PRIMARY KEY, btree (id) >> "idx_ligand_morganbv" gist (morganbv) >> "idx_ligand_pairbv" gist (pairbv) >> "idx_ligand_rdkitmol" gist (rdkitmol) >> "idx_ligand_torsionbv" gist (torsionbv) >> >> >> I cannot explain why the following queries: >> >> db=# select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ; >> count >> ------- >> 2942 >> (1 row) >> >> Time: 193973.253 ms >> db=# select count(*) from ligand where >> morganbv%morganbv_fp('c1cccc2c1nncc2',2); >> count >> ------- >> 8 >> (1 row) >> >> Time: 400138.989 ms > > The performance is critically dependent on whether or not the indices > are in memory. If you've just started the database or if there's been > a lot of non-postgres activity on the machine since the last time you > used it, it can take a long time to load the index from disk to > memory. Once it has been loaded, things should go faster. > > My emolecules database requires about 1 GB for each of the indices: > > emolecules=# \di+ molidx; > List of relations > Schema | Name | Type | Owner | Table | Size | Description > --------+--------+-------+----------+-------+---------+------------- > public | molidx | index | glandrum | mols | 1049 MB | > (1 row) > emolecules=# \di+ mfp2idx; > List of relations > Schema | Name | Type | Owner | Table | Size | Description > --------+---------+-------+----------+-------+--------+------------- > public | mfp2idx | index | glandrum | fps | 970 MB | > (1 row) > > >> >> Take so long... these are orders of magnitude larger than timings >> reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2 >> And my database in "only" roughly 50% larger (8M instead of the puny >> 5M in emolecules). >> >> When I do an "explain" on these queries (to make sure the indices are >> being used), I get: >> >> db=# explain select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ; >> QUERY PLAN >> ---------------------------------------------------------------------------------------------- >> Aggregate (cost=34850.44..34850.45 rows=1 width=0) >> -> Bitmap Heap Scan on ligand (cost=2667.36..34829.36 rows=8433 width=0) >> Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol) >> -> Bitmap Index Scan on idx_ligand_rdkitmol >> (cost=0.00..2665.25 rows=8433 width=0) >> Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol) >> (5 rows) >> >> db=# explain select count(*) from ligand where >> morganbv%morganbv_fp('c1cccc2c1nncc2',2); >> QUERY PLAN >> --------------------------------------------------------------------------------------------------------------------- >> Aggregate (cost=33290.88..33290.89 rows=1 width=0) >> -> Bitmap Heap Scan on ligand (cost=918.05..33269.79 rows=8433 width=0) >> Recheck Cond: (morganbv % >> '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp) >> -> Bitmap Index Scan on idx_ligand_morganbv >> (cost=0.00..915.94 rows=8433 width=0) >> Index Cond: (morganbv % >> '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp) >> (5 rows) >> >> Looks good no? >> Am I missing something? Or is this the fastest my search can go at? >> Supposedly the fingerprints search is just doing some ~8M binary >> operations no? Why does this take so long? >> Ideas, anyone? > > This all looks fine. > > The experiments to try are: > 1) do a second query and see if that's faster > 2) try a smaller table (start with 1 or 2 million) and see how that performs
3) Split the table vertically and create a separate table for the fingerprints 4) Use horizontal sharding and partition the fingerprints table Adrian > Best Regards, > -greg > > ------------------------------------------------------------------------------ > WhatsUp Gold - Download Free Network Management Software > The most intuitive, comprehensive, and cost-effective network > management toolset available today. Delivers lowest initial > acquisition cost and overall TCO of any competing solution. > http://p.sf.net/sfu/whatsupgold-sd > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss