Re: [Rdkit-discuss] Timing of C++ versus Boost.Python

Noel O'Boyle Wed, 06 Aug 2008 07:15:36 -0700

Looks like I forgot to correct the PYTHONPATH. My figures are now in
agreement with yours...


2008/8/6 Noel O'Boyle <[email protected]>:
> 2008/8/6 Greg Landrum <[email protected]>:
>> On 8/6/08, Noel O'Boyle <[email protected]> wrote:
>>> I'm getting some strange results when I compare the timing for using
>>> cinfony to access RDKit, compared to writing the equivalent C++
>>> program. Basically, for iteration over a very large SDF file, it's 50%
>>> slower if you use cinfony (which simply calls SDMolSupplier and
>>> 'yields' the results).
>>
>> That doesn't sound right at all. I expect the python stuff to be a bit
>> slower, but not dramatically so unless you are crossing the
>> python<->c++ barrier frequently. An example of this slow case would be
>> similarity searching using X molecules against a collection of Y
>> molecules where Y (and possible X) is large. If you were to call the
>> TanimotoSimilarity function X*Y times (e.g. once for each pair), that
>> would be pretty slow in python relative to straight C++. If you use
>> the BulkTanimotoSimilarity function from Python (so that you only have
>> to cross over into C++ X times), the difference between Python and C++
>> shrinks.
>>
>>> I'm using an SDF file from ZINC that contains 25000 or so molecules
>>> (the first subset of the drug-like molecules) for testing. I have C++
>>> code like the following:
>>>
>>> void ReadFromFile(){
>>>        ROMol* mol;
>>>        SDMolSupplier molfile = SDMolSupplier("3_p0.0.sdf");
>>>        while (!molfile.atEnd()) {
>>>                mol = molfile.next();
>>>                if (mol) delete mol;
>>>        }
>>> }
>>>
>>> Greg, when you have time (no pun intended), I'd appreciate if you
>>> could compare the run time of this sort of C++ program versus the
>>> equivalent Python script using SDMolSupplier. I'd just like
>>> independent verification for the sort of figures I'm getting. The
>>> thing is, while some overhead is expected, the web says that SWIG
>>> should be worse than Boost.Python, but the SWIG overhead with
>>> OpenBabel is of the order of 4%, not 50%.
>>
>> I wasn't as patient as you, so I only used 2K molecules in my test:
>>
>> ------------------------------------------------------------------------------------
>> -bash-3.00$ cat sample.py
>> import Chem
>> for m in Chem.SDMolSupplier('pubchem_hts.2k.sdf'):
>>  pass
>>
>> -bash-3.00$ time python sample.py
>>
>> real    0m4.097s
>> user    0m4.018s
>> sys     0m0.079s
>>
>> -bash-3.00$ cat sample.cpp
>> #include <GraphMol/RDKitBase.h>
>> #include <GraphMol/FileParsers/MolSupplier.h>
>> #include <RDGeneral/RDLog.h>
>>
>> using namespace RDKit;
>>
>> int
>> main(int argc, char *argv[])
>> {
>>  RDLog::InitLogs();
>>  ROMol* mol;
>>  SDMolSupplier molfile("pubchem_hts.2k.sdf");
>>  while (!molfile.atEnd()) {
>>    mol = molfile.next();
>>    if (mol) delete mol;
>>  }
>> }
>>
>> -bash-3.00$ time ./sample.exe
>>
>> real    0m4.034s
>> user    0m4.003s
>> sys     0m0.031s
>> ------------------------------------------------------------------------------------
>>
>> As you can see, there's little difference.
>>
>> So, why are you seeing such a huge difference? Are you sure that
>> you're using the same C++ backend? i.e. Are the python wrappers you're
>> using linked against the same C++ shared libraries as your sample C++
>> program? The svn state of the code has some optimizations relative to
>> the May2008 release that might explain what you're seeing.
>>
>> -greg
>>
>
>
> OK - there's some problem at my end. I'll track down the usual
> suspects. Thanks for checking this...
>

Re: [Rdkit-discuss] Timing of C++ versus Boost.Python

Reply via email to