Hi

My name is Josh Ring,

I am interested in raw compute performance from C modules called from
python. Most of which are single threaded (eg Numpy, Scipy etc)

Some things are sensible with many threads:
1. Read only global state can avoid locks entirely in multithreaded code,
this is to avoid cache line invalidations killing scaling >2-4 threads.
2. Incr/decr can be paused when entering the parallel region to avoid
invalidating caches of objects, readonly access makes this safe.
2. Locality of memory, so using thread local stack by default and heap
allocations bound per thread this is essential to scale >4 threads and with
 NUMA server systems.
3. Leaving the GIL intact for single threaded code to do the "cleanup
stage" of temporaries after parallel computation has finished.




   - I liked the approach of a "parallel region", where data does not need
   to be pickled, and can directly read-only access shared memory.


   - If global state is unchangeable from a threaded region we can avoid
   many gotchas and races, leaving the GIL alone, almost.


   - If reference counting can be "paused" during the parallel region we
   can avoid cache invalidation from multiple threads, due to incr/decr, which
   limits scaling with more threads, this is evident even with 2 threads.


   - Thread local storage is the default and only option, avoiding clunky
   "threading.local()" storage classes, "thread bound" heap allocations would
   also be a good thing to increase efficiency and reduce "false sharing".
   https://en.wikipedia.org/wiki/False_sharing


   - Implement the parallel region using a function with a decorator akin
   to openMP? The function then defines the scope for the local variables and
   the start and end of parallel region, when to return etc in a
   straightforward manner.


   - By default, the objects returned from the parallel region return into
   separate objects (avoiding GIL contention), these temporary objects are
   then merged into a list once control is returned to just a single thread.


   - Objects marked @thread_shared have their state merged from the thread
   local copies once execution from all the threads has finished. This could
   be made intelligent if some index is provided to put the entries in the
   right place in a list/dict etc.

Thoughts?

This proposal is borrowing several ideas from Pyparallel, but without the
focus on windows only and IO. It is more focused on accelerating raw
compute performance and is tailored for high performance computing for
instance at CERN etc.
_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to