2008/8/6 Greg Landrum <[email protected]>: > On 8/6/08, Noel O'Boyle <[email protected]> wrote: >> I'm getting some strange results when I compare the timing for using >> cinfony to access RDKit, compared to writing the equivalent C++ >> program. Basically, for iteration over a very large SDF file, it's 50% >> slower if you use cinfony (which simply calls SDMolSupplier and >> 'yields' the results). > > That doesn't sound right at all. I expect the python stuff to be a bit > slower, but not dramatically so unless you are crossing the > python<->c++ barrier frequently. An example of this slow case would be > similarity searching using X molecules against a collection of Y > molecules where Y (and possible X) is large. If you were to call the > TanimotoSimilarity function X*Y times (e.g. once for each pair), that > would be pretty slow in python relative to straight C++. If you use > the BulkTanimotoSimilarity function from Python (so that you only have > to cross over into C++ X times), the difference between Python and C++ > shrinks. > >> I'm using an SDF file from ZINC that contains 25000 or so molecules >> (the first subset of the drug-like molecules) for testing. I have C++ >> code like the following: >> >> void ReadFromFile(){ >> ROMol* mol; >> SDMolSupplier molfile = SDMolSupplier("3_p0.0.sdf"); >> while (!molfile.atEnd()) { >> mol = molfile.next(); >> if (mol) delete mol; >> } >> } >> >> Greg, when you have time (no pun intended), I'd appreciate if you >> could compare the run time of this sort of C++ program versus the >> equivalent Python script using SDMolSupplier. I'd just like >> independent verification for the sort of figures I'm getting. The >> thing is, while some overhead is expected, the web says that SWIG >> should be worse than Boost.Python, but the SWIG overhead with >> OpenBabel is of the order of 4%, not 50%. > > I wasn't as patient as you, so I only used 2K molecules in my test: > > ------------------------------------------------------------------------------------ > -bash-3.00$ cat sample.py > import Chem > for m in Chem.SDMolSupplier('pubchem_hts.2k.sdf'): > pass > > -bash-3.00$ time python sample.py > > real 0m4.097s > user 0m4.018s > sys 0m0.079s > > -bash-3.00$ cat sample.cpp > #include <GraphMol/RDKitBase.h> > #include <GraphMol/FileParsers/MolSupplier.h> > #include <RDGeneral/RDLog.h> > > using namespace RDKit; > > int > main(int argc, char *argv[]) > { > RDLog::InitLogs(); > ROMol* mol; > SDMolSupplier molfile("pubchem_hts.2k.sdf"); > while (!molfile.atEnd()) { > mol = molfile.next(); > if (mol) delete mol; > } > } > > -bash-3.00$ time ./sample.exe > > real 0m4.034s > user 0m4.003s > sys 0m0.031s > ------------------------------------------------------------------------------------ > > As you can see, there's little difference. > > So, why are you seeing such a huge difference? Are you sure that > you're using the same C++ backend? i.e. Are the python wrappers you're > using linked against the same C++ shared libraries as your sample C++ > program? The svn state of the code has some optimizations relative to > the May2008 release that might explain what you're seeing. > > -greg >
OK - there's some problem at my end. I'll track down the usual suspects. Thanks for checking this...

