On Sat, 21 Jan 2023, John Ference wrote:
I may have a different mental model for threadpools than is correct. I am
assuming (for J) that we will use not more than one thread per core.
The threads are OS threads, and the OS will multiplex threads over cores
however it sees fit. If you have a similar number of active threads to cores,
they should mostly stay put, but there's no guarantee of that. There are some
affinity controls we could expose and take advantage of, but don't yet.
In one simple model, we have a straightforward hierarchy where each
threadpool contains threads, tasks are constrained to a threadpool.
In this stylized 'hierarchy model':
Tasks can only use threads in the pool. Parallelizable primitives use all
threads only within the pool in which they originate. Tasks block
sequentially within a pool if they contain parallelized primitives.
In the actual model, unless I misunderstand it:
Pool 0 appears to have a special quality allowing it to run primitives from
other pools. No primitive parallelization occurs except in pool 0.
A task runs within a threadpool, but it can spawn new tasks in other
threadpools, if it wants, using u t.n. And parallelisable primitives will
always be run by threads in pool 0 (for now); never by threads in the pool
where they were kicked off (unless that happens to be pool 0). (Actually, I
think that right now, the thread that kicked off the computation will also
help it along. Not sure if that's a good idea or not.)
Creating two tasks in a two-thread, non-0 pool is wholly equivalent to
creating one task in each of two non-0 pools.
Equivalent in what sense? It's equivalent in that, in both cases, each task
will get a thread to itself straightaway. But obviously the subsequent state
will be different (as manifest if, e.g., you create more tasks in those
pools).
If there is a worker thread in pool 0, then two mutli-threaded primitives
called separately from non-0 threadpool tasks will require pool 0 worker to
dedicate cycles to both tasks.
Yes.
Cool. If I understand approximately, this is the pricey-equipment numa
version. Is this equivalent to setting node-threadpools rather than simply
threadpools, or different?
The way I was imagining it, it would be entirely an application concern. You
would set the thread affinity of some threadpool to cores which happen to all
be in one numa node. And set one threadpool to punt its compute work to
another threadpool which just happens to be on the same numa node. But I
dunno; it might be better to do that automatically.
I am assuming compute-bound tasks. Per the 'actual model' interpretation
above, if the primitives across all non-zero pools are using threads in pool
0, pool 0 may receive multiple concurrent requests for parallelization of
primitives. What happens in this case? Are the primitives parallelized
serially, perhaps one blocking? Is the later request not parallelized (seems
not to happen based on the timing discussed).
For parallelisable primitives, they are always queued, and run (in parallel)
as soon as there are threads available to work on them. For user tasks, they
will be run in parallel if there are free threads; if there are no free
threads, they will be queued to run in parallel if you specify t.'worker', and
serialised if you don't.
I didn't look closely enough, and didn't notice that your benchmark also ran
multiple matrix multiplications in parallel--I think probably that's
serialising, and is what's causing the variance. Does the problem go away if
you replace the definition of uf with the following?
uf =. {{u t. (y;'worker') ''}}"0
Similarly, if the primitives from all pools are partly using pool 0: Why
would I create pool 1 and pool 2 and send tasks to each, rather than simply
creating 2 tasks in pool 1? What would be the purpose of adding more than
one pool (aside from numa above)?
The most obvious reason is temporal cache locality. If you've got a lot of
tasks that are working on the same data, then it's a good idea to run those
tasks on the same cores, so that the data remain in those cores' L2. (In that
respect, all CPUs are 'numa', in that they can more quickly access data that
are in their L2 than data that aren't.)
I'll also drop a link to
https://pvk.ca/Blog/2019/02/25/the-unscalable-thread-pool/, although I'm as
yet unsure of the implications.
-E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm