Re: [Jbeta] Pools and Multi-threaded Primitives

Elijah Stone Sat, 21 Jan 2023 23:46:30 -0800

On Sat, 21 Jan 2023, John Ference wrote:

I may have a different mental model for threadpools than is correct. I amassuming (for J) that we will use not more than one thread per core.

The threads are OS threads, and the OS will multiplex threads over coreshowever it sees fit. If you have a similar number of active threads to cores,they should mostly stay put, but there's no guarantee of that. There are someaffinity controls we could expose and take advantage of, but don't yet.

In one simple model, we have a straightforward hierarchy where eachthreadpool contains threads, tasks are constrained to a threadpool.

In this stylized 'hierarchy model':
Tasks can only use threads in the pool. Parallelizable primitives use allthreads only within the pool in which they originate. Tasks blocksequentially within a pool if they contain parallelized primitives.
In the actual model, unless I misunderstand it:
Pool 0 appears to have a special quality allowing it to run primitives fromother pools. No primitive parallelization occurs except in pool 0.

A task runs within a threadpool, but it can spawn new tasks in otherthreadpools, if it wants, using u t.n. And parallelisable primitives willalways be run by threads in pool 0 (for now); never by threads in the poolwhere they were kicked off (unless that happens to be pool 0). (Actually, Ithink that right now, the thread that kicked off the computation will alsohelp it along. Not sure if that's a good idea or not.)

Creating two tasks in a two-thread, non-0 pool is wholly equivalent tocreating one task in each of two non-0 pools.

Equivalent in what sense? It's equivalent in that, in both cases, each taskwill get a thread to itself straightaway. But obviously the subsequent statewill be different (as manifest if, e.g., you create more tasks in thosepools).

If there is a worker thread in pool 0, then two mutli-threaded primitivescalled separately from non-0 threadpool tasks will require pool 0 worker todedicate cycles to both tasks.


Yes.

Cool. If I understand approximately, this is the pricey-equipment numaversion. Is this equivalent to setting node-threadpools rather than simplythreadpools, or different?

The way I was imagining it, it would be entirely an application concern. Youwould set the thread affinity of some threadpool to cores which happen to allbe in one numa node. And set one threadpool to punt its compute work toanother threadpool which just happens to be on the same numa node. But Idunno; it might be better to do that automatically.

I am assuming compute-bound tasks. Per the 'actual model' interpretationabove, if the primitives across all non-zero pools are using threads in pool0, pool 0 may receive multiple concurrent requests for parallelization ofprimitives. What happens in this case? Are the primitives parallelizedserially, perhaps one blocking? Is the later request not parallelized (seemsnot to happen based on the timing discussed).

For parallelisable primitives, they are always queued, and run (in parallel)as soon as there are threads available to work on them. For user tasks, theywill be run in parallel if there are free threads; if there are no freethreads, they will be queued to run in parallel if you specify t.'worker', andserialised if you don't.

I didn't look closely enough, and didn't notice that your benchmark also ranmultiple matrix multiplications in parallel--I think probably that'sserialising, and is what's causing the variance. Does the problem go away ifyou replace the definition of uf with the following?


uf =. {{u t. (y;'worker') ''}}"0

Similarly, if the primitives from all pools are partly using pool 0: Whywould I create pool 1 and pool 2 and send tasks to each, rather than simplycreating 2 tasks in pool 1? What would be the purpose of adding more thanone pool (aside from numa above)?

The most obvious reason is temporal cache locality. If you've got a lot oftasks that are working on the same data, then it's a good idea to run thosetasks on the same cores, so that the data remain in those cores' L2. (In thatrespect, all CPUs are 'numa', in that they can more quickly access data thatare in their L2 than data that aren't.)

I'll also drop a link tohttps://pvk.ca/Blog/2019/02/25/the-unscalable-thread-pool/, although I'm asyet unsure of the implications.


 -E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jbeta] Pools and Multi-threaded Primitives

Reply via email to