On 8/6/08, Noel O'Boyle <[email protected]> wrote:
> I'm getting some strange results when I compare the timing for using
> cinfony to access RDKit, compared to writing the equivalent C++
> program. Basically, for iteration over a very large SDF file, it's 50%
> slower if you use cinfony (which simply calls SDMolSupplier and
> 'yields' the results).

That doesn't sound right at all. I expect the python stuff to be a bit
slower, but not dramatically so unless you are crossing the
python<->c++ barrier frequently. An example of this slow case would be
similarity searching using X molecules against a collection of Y
molecules where Y (and possible X) is large. If you were to call the
TanimotoSimilarity function X*Y times (e.g. once for each pair), that
would be pretty slow in python relative to straight C++. If you use
the BulkTanimotoSimilarity function from Python (so that you only have
to cross over into C++ X times), the difference between Python and C++
shrinks.

> I'm using an SDF file from ZINC that contains 25000 or so molecules
> (the first subset of the drug-like molecules) for testing. I have C++
> code like the following:
>
> void ReadFromFile(){
>        ROMol* mol;
>        SDMolSupplier molfile = SDMolSupplier("3_p0.0.sdf");
>        while (!molfile.atEnd()) {
>                mol = molfile.next();
>                if (mol) delete mol;
>        }
> }
>
> Greg, when you have time (no pun intended), I'd appreciate if you
> could compare the run time of this sort of C++ program versus the
> equivalent Python script using SDMolSupplier. I'd just like
> independent verification for the sort of figures I'm getting. The
> thing is, while some overhead is expected, the web says that SWIG
> should be worse than Boost.Python, but the SWIG overhead with
> OpenBabel is of the order of 4%, not 50%.

I wasn't as patient as you, so I only used 2K molecules in my test:

------------------------------------------------------------------------------------
-bash-3.00$ cat sample.py
import Chem
for m in Chem.SDMolSupplier('pubchem_hts.2k.sdf'):
  pass

-bash-3.00$ time python sample.py

real    0m4.097s
user    0m4.018s
sys     0m0.079s

-bash-3.00$ cat sample.cpp
#include <GraphMol/RDKitBase.h>
#include <GraphMol/FileParsers/MolSupplier.h>
#include <RDGeneral/RDLog.h>

using namespace RDKit;

int
main(int argc, char *argv[])
{
  RDLog::InitLogs();
  ROMol* mol;
  SDMolSupplier molfile("pubchem_hts.2k.sdf");
  while (!molfile.atEnd()) {
    mol = molfile.next();
    if (mol) delete mol;
  }
}

-bash-3.00$ time ./sample.exe

real    0m4.034s
user    0m4.003s
sys     0m0.031s
------------------------------------------------------------------------------------

As you can see, there's little difference.

So, why are you seeing such a huge difference? Are you sure that
you're using the same C++ backend? i.e. Are the python wrappers you're
using linked against the same C++ shared libraries as your sample C++
program? The svn state of the code has some optimizations relative to
the May2008 release that might explain what you're seeing.

-greg

Reply via email to