Responses inline below. Thank you! Cheers, John
On Sat, Jan 21, 2023 at 5:56 PM Elijah Stone <[email protected]> wrote: > On Sat, 21 Jan 2023, John Ference wrote: > > > *NB. Questions: What's the underlying behavior? Is this intended?* > I can't reproduce this. I get: > ,.10{.res > 0.117785 > 0.113945 > 0.116619 > 0.0783475 > 0.114737 > 0.0741682 > Perhaps a bit unusual to see the consistency of the 0.11s followed by the 0.11s run in call 5 after the call 4 0.7s run. Seems peculiar in timing, but agree no pattern. > That the first few timings are slower is to be expected, as the CPU is > warming > up. But there doesn't seem to be any periodic bias such as you see. > > One idea (and I am just spitballing here): you have an amd cpu (I am > assuming, > given that you have 6 threads), and amd cpus have numa cache, which might > somehow bias the work distribution. Whereas I have an intel cpu, and > those > have uma cache. (Or, if it's a newer intel cpu, could be big.little > screwing > things up.) > The original test was performed on an older Intel Xeon. Following your note, I duplicated the test on a newer Intel Xeon with additional cores (call this test #3). With 4 worker threads in pool 0, there is no observable cycle. With 10 worker threads in pool 0, we have a cyclical pattern of frequency 12. This occurs across 500 instances of the test, warmup discarded. 0.048 0.049 0.049 0.049 0.050 0.050 0.049 0.035 0.035 0.035 0.036 0.036 This perhaps suggests the performance of a call is based on the sequence of calls relative to (1 + number of total threads in pool 0). However, in this case (test #3), the difference is significantly less severe and wouldn't suggest a necessary change in program design. Not sure yet what this ultimately means for using the 6-core core chip, except that continuing to observe timing is appropriate. I may look for productive diagnostics and compare with another intermediate Xeon chip. > > NB. Why wouldn't each pool use its own threads for multi-threaded > > primitives? > > Different workloads can be bottlenecked by different things. If you are > bottlenecked by i/o, you might want to have more threads than cores. But > if > you are bottlenecked by compute or memory, having more threads than cores > just > adds overhead. So you might have, e.g., threadpool 1 for i/o and > threadpool 0 > for compute, with appropriate numbers of threads in each. > I may have a different mental model for threadpools than is correct. I am assuming (for J) that we will use not more than one thread per core. In one simple model, we have a straightforward hierarchy where each threadpool contains threads, tasks are constrained to a threadpool. In this stylized 'hierarchy model': > Tasks can only use threads in the pool. > Parallelizable primitives use all threads only within the pool in which they originate. > Tasks block sequentially within a pool if they contain parallelized primitives. In the actual model, unless I misunderstand it: > Pool 0 appears to have a special quality allowing it to run primitives from other pools. > No primitive parallelization occurs except in pool 0. > Creating two tasks in a two-thread, non-0 pool is wholly equivalent to creating one task in each of two non-0 pools. > If there is a worker thread in pool 0, then two mutli-threaded primitives called separately from non-0 threadpool tasks will require pool 0 worker to dedicate cycles to both tasks. Something I want to do (but haven't implemented yet) is to let each pool > pick > which threadpool it punts its work to. So e.g. you could have 4 > threadpools: > i/o for numa node 0, compute for numa node 0, i/o for numa node 1, compute > for > numa node 1. Cool. If I understand approximately, this is the pricey-equipment numa version. Is this equivalent to setting node-threadpools rather than simply threadpools, or different? > NB. What's the point of creating a second pool with multiple threads, > rather > > than separate pools? > > What do you mean by this? > NB. How is resource contention between pools adjudicated? > > Which resources? > I am assuming compute-bound tasks. Per the 'actual model' interpretation above, if the primitives across all non-zero pools are using threads in pool 0, pool 0 may receive multiple concurrent requests for parallelization of primitives. What happens in this case? Are the primitives parallelized serially, perhaps one blocking? Is the later request not parallelized (seems not to happen based on the timing discussed). Similarly, if the primitives from all pools are partly using pool 0: Why would I create pool 1 and pool 2 and send tasks to each, rather than simply creating 2 tasks in pool 1? What would be the purpose of adding more than one pool (aside from numa above)? I'm expecting that my assumptions on threadpools and multi-threaded primitives are off somewhere. -E > > P.S. very sorry for not responding yet to your earlier email--I have been > tied > up, but haven't forgotten it! +1 no worries ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
