On Sat, 21 Jan 2023, John Ference wrote:

I may have a different mental model for threadpools than is correct. I am assuming (for J) that we will use not more than one thread per core.

The threads are OS threads, and the OS will multiplex threads over cores however it sees fit. If you have a similar number of active threads to cores, they should mostly stay put, but there's no guarantee of that. There are some affinity controls we could expose and take advantage of, but don't yet.


In one simple model, we have a straightforward hierarchy where each threadpool contains threads, tasks are constrained to a threadpool.

In this stylized 'hierarchy model':
Tasks can only use threads in the pool. Parallelizable primitives use all threads only within the pool in which they originate. Tasks block sequentially within a pool if they contain parallelized primitives.

In the actual model, unless I misunderstand it:
Pool 0 appears to have a special quality allowing it to run primitives from other pools. No primitive parallelization occurs except in pool 0.

A task runs within a threadpool, but it can spawn new tasks in other threadpools, if it wants, using u t.n. And parallelisable primitives will always be run by threads in pool 0 (for now); never by threads in the pool where they were kicked off (unless that happens to be pool 0). (Actually, I think that right now, the thread that kicked off the computation will also help it along. Not sure if that's a good idea or not.)

Creating two tasks in a two-thread, non-0 pool is wholly equivalent to creating one task in each of two non-0 pools.

Equivalent in what sense? It's equivalent in that, in both cases, each task will get a thread to itself straightaway. But obviously the subsequent state will be different (as manifest if, e.g., you create more tasks in those pools).

If there is a worker thread in pool 0, then two mutli-threaded primitives called separately from non-0 threadpool tasks will require pool 0 worker to dedicate cycles to both tasks.

Yes.

Cool. If I understand approximately, this is the pricey-equipment numa version. Is this equivalent to setting node-threadpools rather than simply threadpools, or different?

The way I was imagining it, it would be entirely an application concern. You would set the thread affinity of some threadpool to cores which happen to all be in one numa node. And set one threadpool to punt its compute work to another threadpool which just happens to be on the same numa node. But I dunno; it might be better to do that automatically.


I am assuming compute-bound tasks. Per the 'actual model' interpretation above, if the primitives across all non-zero pools are using threads in pool 0, pool 0 may receive multiple concurrent requests for parallelization of primitives. What happens in this case? Are the primitives parallelized serially, perhaps one blocking? Is the later request not parallelized (seems not to happen based on the timing discussed).

For parallelisable primitives, they are always queued, and run (in parallel) as soon as there are threads available to work on them. For user tasks, they will be run in parallel if there are free threads; if there are no free threads, they will be queued to run in parallel if you specify t.'worker', and serialised if you don't.

I didn't look closely enough, and didn't notice that your benchmark also ran multiple matrix multiplications in parallel--I think probably that's serialising, and is what's causing the variance. Does the problem go away if you replace the definition of uf with the following?

uf =. {{u t. (y;'worker') ''}}"0


Similarly, if the primitives from all pools are partly using pool 0: Why would I create pool 1 and pool 2 and send tasks to each, rather than simply creating 2 tasks in pool 1? What would be the purpose of adding more than one pool (aside from numa above)?

The most obvious reason is temporal cache locality. If you've got a lot of tasks that are working on the same data, then it's a good idea to run those tasks on the same cores, so that the data remain in those cores' L2. (In that respect, all CPUs are 'numa', in that they can more quickly access data that are in their L2 than data that aren't.)

I'll also drop a link to https://pvk.ca/Blog/2019/02/25/the-unscalable-thread-pool/, although I'm as yet unsure of the implications.

 -E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to