Dear Greg,

In your note below you talk about saving a molecule in a binary format. By this 
you mean a fingerprint? But in that case you wouldn't be able to perform SMARTS 
matches, right? Only at most approximate Tversky similarity calculations, only 
if your SMARTS is a valid SMILES. 

Thanks,
Gonzalo

-----Original Message-----
From: Greg Landrum [mailto:[email protected]] 
Sent: 24 July 2012 16:56
To: Gonzalo Colmenarejo-Sanchez
Cc: [email protected]
Subject: Re: [Rdkit-discuss] speed of SMARTS matches calculations

On Tue, Jul 24, 2012 at 4:38 PM, Gonzalo Colmenarejo-Sanchez
<[email protected]> wrote:
>
> Sorry I can't share the SMILES and SMARTS, they are proprietary.

yeah, I kind of figured that would be the case. :-)

> If you can send me your structures I can test them with my program.

The scripts and data for the benchmarking are all in $RDBASE/Regress

> I double loop in the building of molecules and queries; the actual code is 
> this:
>
> for (i = 0; i < numsmi; i++)
> {
>         mol = SmilesToMol(smiles[i].smiles);
>         numsims = 0;
>         fprintf(fpout, "%s,", smiles[i].smiles);
>         fprintf(stdout, "%d\n", i);
>       for (j = 0; j < numsma; j++)
>       {
>                 pattern = SmartsToMol(smarts[j].smarts);
>                 matchesfound = SubstructMatch(*mol,*pattern,matches, false, 
> false);
>             if (matchesfound == true)
>             {
>                 numsims = numsims + 1;
>                 if (numsims == 1) fprintf(fpout, "%s\n", smarts[j].smarts);
>                         else fprintf(fpout, "%s,%s\n", smiles[i].smiles, 
> smarts[j].smarts);
>                 }
>                 delete pattern;
>         }
>       if (numsims == 0) fprintf(fpout, "\n");
>         delete mol;
> }
>
>
> The same double loop structure is used in the DL program. I could build the 
> molecules and queries at once as you suggest but I'm kind of testing my 
> typical situation that involves millions of molecules - not sure if that many 
> of molecules can be stored in memory.
>

The above is ok w.r.t. the molecules: each molecule is only
constructed once.[1] Your SMARTS queries are, on the other hand, being
constructed over and over again. You would probably see some speedup
by building the query molecules outside the molecule loop and just
using those inside the loop.

-greg
[1] Note: if you have a set of molecules you process over and over
again, there are some time-saving tricks for working with them. One is
to process them once and then save them in binary form, the other is
to process them once, output the RDKit canonical SMILES, and then
rebuild molecules from that using only partial sanitization.



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to