Hi,

On Thu, Apr 25, 2013 at 4:44 PM, Markus Hartenfeller <
markus.hartenfel...@molecularhealth.com> wrote:

>  Ah OK, I see ;)
>
> This I don't know, sorry.
>
> I have implemented a de novo design tool based on rdkit that made heavy
> use of the smarts matching accessed via Python. From my experience the
> substructure matching will probably not be the performance bottleneck
> (depending on how sophisticated your scoring functions is, of course).
>

I think Markus has it right: the substructure matching is unlikely to be
the bottleneck. As long as you construct the query molecules via
MolFromSmarts in advance (outside the loop), you should be fine.
If you do find that you're spending lots of time waiting for the
substructure matcher, you can try moving the whole PAINS bit into C++. This
wouldn't be overly complex and would make taking advantage of the
embarrassingly parallel nature of the problem (mentioned by Simon in
another message) easier.

Something you might want to consider regarding the performance of SMARTS
> queries is the SMARTS themselves. This is a quote from the daylight page
> http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html:
>
>
> Efficiency Considerations
> The Daylight 4.x SMARTS Toolkit provides a function, dt_smarts_opt(),
> which automatically optimizes a SMARTS by reordering, expanding, and/or
> consolidating atom and bond expressions. Programs which use this feature
> (e.g. the Merlin program) can be expected to be near optimal in terms of
> the time used to search typical organic structures.
> When this optimization method is not used, there are some things which can
> be done to facilitate efficient (fast) searching operations using SMARTS.
> It is important to recognize that SMARTS target strings are processed in
> strictly left-to-right order. For this reason, substantial gains in speed
> can be achieved by following these guidelines:
>
>
>    - Uncommon atoms or bond arrangements should be placed early in SMARTS
>    targets.
>
>     - In an "and-expression", the less common atom or bond specifications
>    should be placed early.
>
>     - In an "or-expression", the less common atom or bond specifications
>    should be placed last.
>
>
>
> I understand that the SMARTS you want to use have already been designed,
> and of course what is stated by daylight refers to the their
> implementation. But (a) I think it applies to the rdkit implementation as
> well (please correct me if I'm wrong here, Greg) and (b) in case
> performance is really critical this could be another point to look at.
>

The form of the SMARTS definitely matters. Neither SmartsToMol nor the
substructure matcher makes any attempt to optimize queries (aside from
recognizing repeated recursive SMARTS expressions), so the advice above is
quite good.

-greg
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to