Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

Francois Berenger Tue, 25 Oct 2022 18:57:01 -0700

On 24/10/2022 19:47, David Cosgrove wrote:

For the record, I have attempted this, but got only a marginal
speed-up (130% of CPU used, with any number of threads above 2).  The
procedure I used was to extract the fingerprint pointers into a
std::vector, create a std::vector for the results, unlock the GIL to
do the bulk tanimoto calculation, then re-lock the GIL to copy the
results from the std::vector into the python:list for output.  I guess
the extra overhead to create and populate the additional std::vectors
destroyed any potential speedup.  This was on a vector of 200K
fingerprints, which suggests that the Tanimoto calculation is a small
part of the overall time.  It doesn't seem worth pursuing further.


There is probably code on github doing this in parallel already.
Think about it: any clustering algorithm using a distance matrix.
I guess many people want to initialize the Gram matrix in parallel.

I wouldn't be surprised if, for example, chemfp has such code.

Dave

On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove
<davidacosgrov...@gmail.com> wrote:

Hi Greg,
Thanks for the pointer. I’ll take a look. If it could go in the
next patch release that would be really useful.
Dave

On Sat, 22 Oct 2022 at 10:52, Greg Landrum <greg.land...@gmail.com>
wrote:

Hi Dave,

We have multiple examples of this in the code, here’s one:

https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40


I’m not sure how this would interact with the call to
Python::extract that’s in the bulk functions though

It might be better to handle the multithreading on the C++ side by
adding an optional nThreads argument to  the bulk similarity
functions. (Though this would have to wait for the next release
since it’s a feature addition… we can declare releasing the GIL
as a bug fix)

-greg

On Sat, 22 Oct 2022 at 09:48, David Cosgrove
<davidacosgrov...@gmail.com> wrote:

Hi,

I'm doing a lot of tanimoto similarity calculations on large
datasets using BulkTanimotoSimilarity.  It is an obvious candidate
for parallelisation, so I am using concurrent.futures to do so.  If
I use ProcessPoolExectuor, I get good speed-up but each process
needs a copy of the fingerprint set and for the sizes I'm dealing
with that uses too much memory.  With ThreadPoolExecutor I only need
1 copy of the fingerprints, but the GIL means it only runs on 1
thread at a time so there's no gain.  Would it be possible to amend
the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the
calculation, and recapture it afterwards?  I understand things like
numpy do this for some of their functions.  I'm happy to attempt it
myself if someone who knows about these things can advise that it
could be done, it would help, and they could provide a few pointers.

Thanks,
Dave

--

David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

 --

David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

--

David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

Reply via email to