Hi Jaan,
On Sun, Jan 1, 2017 at 8:58 AM, jaan gruber <jaan.gru...@yandex.com> wrote:
> Hi.
> I have been experimenting with rdkit and PG cartridge recently. Both are
> really cool.
>
Glad to hear it!
> However I have a question related to similarity functions like this
>
> """create or replace function get_mfp2_neighbors(smiles text)
>
> returns table(molregno integer, m mol, similarity double precision) as
> $$
> select
> molregno,m,tanimoto_sml(morganbv_fp(mol_from_smiles($1::cstring)),mfp2)
> as similarity
> from rdk.fps join rdk.mols using (molregno)
> where morganbv_fp(mol_from_smiles($1::cstring))%mfp2
> -- order by morganbv_fp(mol_from_smiles($1::cstring))<%>mfp2;
> $$ language sql stable ;
> """
>
> Is it possible to modify the function so it doesn't recalculate
> morganbv_fp(mol_from_smiles($1::cstring))?
>
Sure, but I believe that PostgreSQL is smart enough that it doesn't
actually do any recalculation.
Here's a version of the function that only has the explicit fingerprint
calculation once:
create or replace function get_mfp2_neighbors2(smiles text)
returns table(molregno integer, m mol, similarity double precision) as
$$
select molregno,m,tanimoto_sml(qfp,mfp2) as similarity
from (select molregno,mfp2,morganbv_fp(mol_from_smiles($1::cstring)) qfp
from rdk.fps) fps
join rdk.mols using (molregno)
where qfp%mfp2
order by qfp<%>mfp2;
$$ language sql stable ;
I think I like this the readability of this a bit better than the original
form, but the performance is the same:
chembl_21=# select get_mfp2_neighbors('Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)
c1ccc(C#N)cc1');
get_mfp2_neighbors
--------------------------------------------------------------------
(2,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1",1)
(6,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccccc1",0.775510204081633)
(5,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1",0.76)
(3 rows)
Time: 573.769 ms
chembl_21=# select get_mfp2_neighbors2('Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)
c1ccc(C#N)cc1');
get_mfp2_neighbors2
--------------------------------------------------------------------
(2,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1",1)
(6,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccccc1",0.775510204081633)
(5,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1",0.76)
(3 rows)
Time: 571.064 ms
> Or would it be possible that the function accept mol or even better -
> fingerprint?
>
Also possible:
create or replace function get_mfp2_neighbors3(qfp bfp)
returns table(molregno integer, m mol, similarity double precision) as
$$
select molregno,m,tanimoto_sml(qfp,mfp2) as similarity
from rdk.fps
join rdk.mols using (molregno)
where qfp%mfp2
order by qfp<%>mfp2;
$$ language sql stable ;
But, again, it doesn't make a difference in terms of performance:
chembl_21=# select get_mfp2_neighbors3(morganbv_
fp('Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1'));
get_mfp2_neighbors3
--------------------------------------------------------------------
(2,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1",1)
(6,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccccc1",0.775510204081633)
(5,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1",0.76)
(3 rows)
Time: 564.451 ms
Or maybe ID of your "query fingerprint" in another table? I a plan to run
> large number of comparisons (tens of thousands) with several fingerprints,
> so I am looking for ways how to make it fastest possible.
>
PostgreSQL tends to be pretty smart about not doing un-necessary work, so I
believe that the original form is optimal in terms of performance, but you
can choose others if you find them more readable/useable.
Hopefully there's enough info here to allow you to do some experiments of
your own.
-greg
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss