2008/8/6 Greg Landrum <[email protected]>:
> On 8/6/08, Noel O'Boyle <[email protected]> wrote:
>> I'm getting some strange results when I compare the timing for using
>> cinfony to access RDKit, compared to writing the equivalent C++
>> program. Basically, for iteration over a very large SDF file, it's 50%
>> slower if you use cinfony (which simply calls SDMolSupplier and
>> 'yields' the results).
>
> That doesn't sound right at all. I expect the python stuff to be a bit
> slower, but not dramatically so unless you are crossing the
> python<->c++ barrier frequently. An example of this slow case would be
> similarity searching using X molecules against a collection of Y
> molecules where Y (and possible X) is large. If you were to call the
> TanimotoSimilarity function X*Y times (e.g. once for each pair), that
> would be pretty slow in python relative to straight C++. If you use
> the BulkTanimotoSimilarity function from Python (so that you only have
> to cross over into C++ X times), the difference between Python and C++
> shrinks.
>
>> I'm using an SDF file from ZINC that contains 25000 or so molecules
>> (the first subset of the drug-like molecules) for testing. I have C++
>> code like the following:
>>
>> void ReadFromFile(){
>>        ROMol* mol;
>>        SDMolSupplier molfile = SDMolSupplier("3_p0.0.sdf");
>>        while (!molfile.atEnd()) {
>>                mol = molfile.next();
>>                if (mol) delete mol;
>>        }
>> }
>>
>> Greg, when you have time (no pun intended), I'd appreciate if you
>> could compare the run time of this sort of C++ program versus the
>> equivalent Python script using SDMolSupplier. I'd just like
>> independent verification for the sort of figures I'm getting. The
>> thing is, while some overhead is expected, the web says that SWIG
>> should be worse than Boost.Python, but the SWIG overhead with
>> OpenBabel is of the order of 4%, not 50%.
>
> I wasn't as patient as you, so I only used 2K molecules in my test:
>
> ------------------------------------------------------------------------------------
> -bash-3.00$ cat sample.py
> import Chem
> for m in Chem.SDMolSupplier('pubchem_hts.2k.sdf'):
>  pass
>
> -bash-3.00$ time python sample.py
>
> real    0m4.097s
> user    0m4.018s
> sys     0m0.079s
>
> -bash-3.00$ cat sample.cpp
> #include <GraphMol/RDKitBase.h>
> #include <GraphMol/FileParsers/MolSupplier.h>
> #include <RDGeneral/RDLog.h>
>
> using namespace RDKit;
>
> int
> main(int argc, char *argv[])
> {
>  RDLog::InitLogs();
>  ROMol* mol;
>  SDMolSupplier molfile("pubchem_hts.2k.sdf");
>  while (!molfile.atEnd()) {
>    mol = molfile.next();
>    if (mol) delete mol;
>  }
> }
>
> -bash-3.00$ time ./sample.exe
>
> real    0m4.034s
> user    0m4.003s
> sys     0m0.031s
> ------------------------------------------------------------------------------------
>
> As you can see, there's little difference.
>
> So, why are you seeing such a huge difference? Are you sure that
> you're using the same C++ backend? i.e. Are the python wrappers you're
> using linked against the same C++ shared libraries as your sample C++
> program? The svn state of the code has some optimizations relative to
> the May2008 release that might explain what you're seeing.
>
> -greg
>


OK - there's some problem at my end. I'll track down the usual
suspects. Thanks for checking this...

Reply via email to