I wonder if there is a way to make use of PyTorch or tensorflow to do this on a GPU. That’s where some big speed ups might be found.
Also, consider using these bounds. They do make a big difference in many cases. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527184/ On Tue, Oct 25, 2022 at 8:57 PM Francois Berenger <mli...@ligand.eu> wrote: > On 24/10/2022 19:47, David Cosgrove wrote: > > For the record, I have attempted this, but got only a marginal > > speed-up (130% of CPU used, with any number of threads above 2). The > > procedure I used was to extract the fingerprint pointers into a > > std::vector, create a std::vector for the results, unlock the GIL to > > do the bulk tanimoto calculation, then re-lock the GIL to copy the > > results from the std::vector into the python:list for output. I guess > > the extra overhead to create and populate the additional std::vectors > > destroyed any potential speedup. This was on a vector of 200K > > fingerprints, which suggests that the Tanimoto calculation is a small > > part of the overall time. It doesn't seem worth pursuing further. > > There is probably code on github doing this in parallel already. > Think about it: any clustering algorithm using a distance matrix. > I guess many people want to initialize the Gram matrix in parallel. > > I wouldn't be surprised if, for example, chemfp has such code. > > > Dave > > > > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove > > <davidacosgrov...@gmail.com> wrote: > > > >> Hi Greg, > >> Thanks for the pointer. I’ll take a look. If it could go in the > >> next patch release that would be really useful. > >> Dave > >> > >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum <greg.land...@gmail.com> > >> wrote: > >> > >> Hi Dave, > >> > >> We have multiple examples of this in the code, here’s one: > >> > >> > > > https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 > >> > >> I’m not sure how this would interact with the call to > >> Python::extract that’s in the bulk functions though > >> > >> It might be better to handle the multithreading on the C++ side by > >> adding an optional nThreads argument to the bulk similarity > >> functions. (Though this would have to wait for the next release > >> since it’s a feature addition… we can declare releasing the GIL > >> as a bug fix) > >> > >> -greg > >> > >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove > >> <davidacosgrov...@gmail.com> wrote: > >> > >> Hi, > >> > >> I'm doing a lot of tanimoto similarity calculations on large > >> datasets using BulkTanimotoSimilarity. It is an obvious candidate > >> for parallelisation, so I am using concurrent.futures to do so. If > >> I use ProcessPoolExectuor, I get good speed-up but each process > >> needs a copy of the fingerprint set and for the sizes I'm dealing > >> with that uses too much memory. With ThreadPoolExecutor I only need > >> 1 copy of the fingerprints, but the GIL means it only runs on 1 > >> thread at a time so there's no gain. Would it be possible to amend > >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the > >> calculation, and recapture it afterwards? I understand things like > >> numpy do this for some of their functions. I'm happy to attempt it > >> myself if someone who knows about these things can advise that it > >> could be done, it would help, and they could provide a few pointers. > >> > >> Thanks, > >> Dave > >> > >> -- > >> > >> David Cosgrove > >> Freelance computational chemistry and chemoinformatics developer > >> http://cozchemix.co.uk > >> > >> _______________________________________________ > >> Rdkit-discuss mailing list > >> Rdkit-discuss@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- > > > > David Cosgrove > > Freelance computational chemistry and chemoinformatics developer > > http://cozchemix.co.uk > > > > -- > > > > David Cosgrove > > Freelance computational chemistry and chemoinformatics developer > > http://cozchemix.co.uk > > _______________________________________________ > > Rdkit-discuss mailing list > > Rdkit-discuss@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Sent from Gmail Mobile
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss