Looks like I forgot to correct the PYTHONPATH. My figures are now in agreement with yours...
2008/8/6 Noel O'Boyle <[email protected]>: > 2008/8/6 Greg Landrum <[email protected]>: >> On 8/6/08, Noel O'Boyle <[email protected]> wrote: >>> I'm getting some strange results when I compare the timing for using >>> cinfony to access RDKit, compared to writing the equivalent C++ >>> program. Basically, for iteration over a very large SDF file, it's 50% >>> slower if you use cinfony (which simply calls SDMolSupplier and >>> 'yields' the results). >> >> That doesn't sound right at all. I expect the python stuff to be a bit >> slower, but not dramatically so unless you are crossing the >> python<->c++ barrier frequently. An example of this slow case would be >> similarity searching using X molecules against a collection of Y >> molecules where Y (and possible X) is large. If you were to call the >> TanimotoSimilarity function X*Y times (e.g. once for each pair), that >> would be pretty slow in python relative to straight C++. If you use >> the BulkTanimotoSimilarity function from Python (so that you only have >> to cross over into C++ X times), the difference between Python and C++ >> shrinks. >> >>> I'm using an SDF file from ZINC that contains 25000 or so molecules >>> (the first subset of the drug-like molecules) for testing. I have C++ >>> code like the following: >>> >>> void ReadFromFile(){ >>> ROMol* mol; >>> SDMolSupplier molfile = SDMolSupplier("3_p0.0.sdf"); >>> while (!molfile.atEnd()) { >>> mol = molfile.next(); >>> if (mol) delete mol; >>> } >>> } >>> >>> Greg, when you have time (no pun intended), I'd appreciate if you >>> could compare the run time of this sort of C++ program versus the >>> equivalent Python script using SDMolSupplier. I'd just like >>> independent verification for the sort of figures I'm getting. The >>> thing is, while some overhead is expected, the web says that SWIG >>> should be worse than Boost.Python, but the SWIG overhead with >>> OpenBabel is of the order of 4%, not 50%. >> >> I wasn't as patient as you, so I only used 2K molecules in my test: >> >> ------------------------------------------------------------------------------------ >> -bash-3.00$ cat sample.py >> import Chem >> for m in Chem.SDMolSupplier('pubchem_hts.2k.sdf'): >> pass >> >> -bash-3.00$ time python sample.py >> >> real 0m4.097s >> user 0m4.018s >> sys 0m0.079s >> >> -bash-3.00$ cat sample.cpp >> #include <GraphMol/RDKitBase.h> >> #include <GraphMol/FileParsers/MolSupplier.h> >> #include <RDGeneral/RDLog.h> >> >> using namespace RDKit; >> >> int >> main(int argc, char *argv[]) >> { >> RDLog::InitLogs(); >> ROMol* mol; >> SDMolSupplier molfile("pubchem_hts.2k.sdf"); >> while (!molfile.atEnd()) { >> mol = molfile.next(); >> if (mol) delete mol; >> } >> } >> >> -bash-3.00$ time ./sample.exe >> >> real 0m4.034s >> user 0m4.003s >> sys 0m0.031s >> ------------------------------------------------------------------------------------ >> >> As you can see, there's little difference. >> >> So, why are you seeing such a huge difference? Are you sure that >> you're using the same C++ backend? i.e. Are the python wrappers you're >> using linked against the same C++ shared libraries as your sample C++ >> program? The svn state of the code has some optimizations relative to >> the May2008 release that might explain what you're seeing. >> >> -greg >> > > > OK - there's some problem at my end. I'll track down the usual > suspects. Thanks for checking this... >

