Re: [Rdkit-devel] fmcs in RDKit

Andrew Dalke Sat, 01 Sep 2012 11:43:35 -0700

On Sep 1, 2012, at 6:00 PM, James Swetnam wrote:
> Sorry, hit send prematurely.
> 
> I'd define a separate function FindMCSWithTimeout for the timeout parameter 
> unless a majority of users will be concerned with this method's running time, 
> but I'm not familiar with the algorithm's details.

The MCS algorithm is one of those normally fast but
not infrequently exponential slow algorithms. In my various
tests, including simple pairwise tests of 10,000 randomly
chosen pairs of structures from ChEMBL 13, I found that:

     7 pairs took longer than 30 seconds,
  9977 took less than 2 seconds,
  9958 took less than 1 second,
  9420 took less than 0.1 seconds,
  4820 took less than 0.01 seconds.
  1513 took less than 0.001 seconds

As I recall, sometimes the search can take many hours.

My belief is that the slow runtimes occur often enough
that everyone will use a timeout. I also believe that early
on, people who use the API will test with structures where
the MCS is easy to find. Only later on will they get stuck
in some slow-downs (which is especially a problem on a
web server), and then have to diagnose it.

BTW, unlike, say, the CDK code, timeout failures do NOT
raise an exception. One of the advantages of my algorithm
over some of the other ones is that I can report the best
size to date. The returned MCSResult contains:
 - number of atoms in the MCS
 - number of bonds in the MCS
 - SMARTS pattern
 - a 'completed' flag, to know if MCS search completed
    (True) or stopped due to some other reason,
    like a timeout (False).

Indigo's scaffold finder uses a per-indigo-instance variable
to set the timeout. My testing showed that Indigo's code,
while faster than fmcs, still has timeout issues.

CDK's MCS code implements a timeout.

Small Molecule Subgraph Detector (SMSD) does not have a timeout.
I had to write my own wrapper around it to kill it when it took
too long.

OEChem's MCS code does not seem to have a timeout. They have
two different methods, one approximate and the other exact.
I don't have the experience to judge if there are problems
when using the exact method in practice.

The existence of timeouts in other codes suggests that
it's important, and it's hard to remove the need.

// This is OpenBabel like. It's orthogonal to the method itself.

I don't know what you mean by orthogonal. That is,
why does the run-time allocation any more orthogonal
than the binary options

        ring_matches_ring_only = False,
        complete_rings_only = False,

There are several ways to handle all of the parameters:

 - have a function which takes all of the parameters

 - create a search class, where some parameters can be set in
       the constructor.

 - and/or, which has methods methods like "SetTimeout(double seconds)"

 - create a "search parameters" class, which holds all of the
      values and is passed to the search function

 - create different classes, one for each parameterization

 - like Indigo, support multiple toolkit instance, where
     each instance can have its own settings

 - introduce global (or per-thread) variables

Of these, I thought that a single function call interface
was the easiest to understand.

Also, I lean towards the functional programming ideal of
not changing state, so don't like "SetTimeout" or the
approaches which set some sort of (semi-) global parameter.

> Agree with Greg's assertion to rename to FindMCS.

Yes, I'll be changing the name.

>         // Rename Default to something like MCSParams, and have a factory 
> function returning a DefaultMCSParams  I assume this is the command pattern?

It's not a command pattern. It's minimal data only.

=================
### A place to set global options
# (Is this really useful?)

class Default(object):
   timeout = None
   timeout_string = "none"
   maximize = "bonds"
   atom_compare = "elements"
   bond_compare = "bondtypes"
   match_valences = False
   ring_matches_ring_only = False
   complete_rings_only = False
==================

> // Logging should really be separated from API methods.  You should be
> using the python logging framework such that users
> // can modify these globally and not through the API

Then perhaps "logging" is the wrong name for it. "Progress" is
much better.

It's used in the command-line tool to show progress information.
When something takes 30 seconds to run, it's comforting to know
that the underlying code is doing something. This would also be
true for a GUI display. Ideally the API would support a progress
handler, along with an option to abort.

I think it's best to strip it out of the RDKit version of the code.
What's there is only for the command-line tool, and I don't have
the time/funding to develop a proper API.

BTW, I don't think Python's logging system is appropriate for this.
Actually, I don't like using logs for most things, but that's
a different topic.

>         verbose_delay=1.0,
> // Also remove from this function

Yes. See above.

This is an example of something which is hard to integrate with
the logging API. I could log every event, but that's useless.
Instead, I need to provide some sample. I could do it every
N events, but that's CPU- and load-specific. So I chose to use
a time delay.

I don't know how to configure logging to handle that case.

On Fri, Aug 31, 2012 at 8:37 PM, Greg Landrum <greg.land...@gmail.com> wrote:
> I would have no objection to the command-line driver being there in a
> separate file in the package; it could well be useful to people.

It could well be. However:

1) this for me is more of a contractual obligation, in that
my client asked me nicely (it's not actually in the contract)
to put it into RDKit. I don't have much free time/funding
to do this.

2) The command-line tool was designed for my client, and while
I think it's general enough, I haven't gotten any feedback
about it. The isotope-label-as-atom-classes hack is my biggest
concern here.

3) As you all can see, the API is designed specifically for
the command-line tool. Stripping out the currently named
'logging' code would mean also modifying the command-line
driver. That's more work. Especially since the work I would
like to do is to improve the algorithm.

4) I would also have to get a sense of what RDKit's command-line
parameter style looks like, so that the code fits in.

Hence, rather more work than I would like to take on
when so far neither Greg nor my clients have asked for it.

                                Andrew
                                da...@dalkescientific.com

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Rdkit-devel mailing list
Rdkit-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-devel

Re: [Rdkit-devel] fmcs in RDKit

Reply via email to