Hi Jaan,

On Sun, Jan 1, 2017 at 8:58 AM, jaan gruber <jaan.gru...@yandex.com> wrote:

> Hi.
> I have been experimenting with rdkit and PG cartridge recently. Both are
> really cool.
>

Glad to hear it!


> However I have a question related to similarity functions like this
>
> """create or replace function get_mfp2_neighbors(smiles text)
>
>     returns table(molregno integer, m mol, similarity double precision) as
>   $$
>   select 
> molregno,m,tanimoto_sml(morganbv_fp(mol_from_smiles($1::cstring)),mfp2)
> as similarity
>   from rdk.fps join rdk.mols using (molregno)
>   where morganbv_fp(mol_from_smiles($1::cstring))%mfp2
>   -- order by morganbv_fp(mol_from_smiles($1::cstring))<%>mfp2;
>   $$ language sql stable ;
> """
>
> Is it possible to modify the function so it doesn't recalculate
> morganbv_fp(mol_from_smiles($1::cstring))?
>

Sure, but I believe that PostgreSQL is smart enough that it doesn't
actually do any recalculation.

Here's a version of the function that only has the explicit fingerprint
calculation once:

create or replace function get_mfp2_neighbors2(smiles text)
    returns table(molregno integer, m mol, similarity double precision) as
  $$
  select molregno,m,tanimoto_sml(qfp,mfp2) as similarity
  from (select molregno,mfp2,morganbv_fp(mol_from_smiles($1::cstring)) qfp
from rdk.fps) fps
  join rdk.mols using (molregno)
  where qfp%mfp2
  order by qfp<%>mfp2;
  $$ language sql stable ;


I think I like this the readability of this a bit better than the original
form, but the performance is the same:

chembl_21=# select get_mfp2_neighbors('Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)
c1ccc(C#N)cc1');
                         get_mfp2_neighbors
--------------------------------------------------------------------
 (2,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1",1)
 (6,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccccc1",0.775510204081633)
 (5,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1",0.76)
(3 rows)

Time: 573.769 ms
chembl_21=# select get_mfp2_neighbors2('Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)
c1ccc(C#N)cc1');
                        get_mfp2_neighbors2
--------------------------------------------------------------------
 (2,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1",1)
 (6,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccccc1",0.775510204081633)
 (5,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1",0.76)
(3 rows)

Time: 571.064 ms


> Or would it be possible that the function accept mol or even better -
> fingerprint?
>

Also possible:

create or replace function get_mfp2_neighbors3(qfp bfp)
      returns table(molregno integer, m mol, similarity double precision) as
    $$
    select molregno,m,tanimoto_sml(qfp,mfp2) as similarity
    from rdk.fps
    join rdk.mols using (molregno)
    where qfp%mfp2
    order by qfp<%>mfp2;
    $$ language sql stable ;


 But, again, it doesn't make a difference in terms of performance:

chembl_21=# select get_mfp2_neighbors3(morganbv_
fp('Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1'));
                        get_mfp2_neighbors3
--------------------------------------------------------------------
 (2,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1",1)
 (6,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccccc1",0.775510204081633)
 (5,"Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1",0.76)
(3 rows)

Time: 564.451 ms

Or maybe ID of your "query fingerprint" in another table? I a plan to run
> large number of comparisons (tens of thousands) with several fingerprints,
> so I am looking for ways how to make it fastest possible.
>

PostgreSQL tends to be pretty smart about not doing un-necessary work, so I
believe that the original form is optimal in terms of performance, but you
can choose others if you find them more readable/useable.

Hopefully there's enough info here to allow you to do some experiments of
your own.

-greg
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to