Hi Luis, To massively oversimplify https://en.wikipedia.org/wiki/Amdahl%27s_law, making things go faster always has a limiting factor. In parallel computation in a shared-memory model (as POSIX threads use), a major limiting factor will be that system’s memory bandwidth.
An additional complication to achieving huge parallelism for the task in your test code is that while we as humans know that here, each thread is doing the same thing (load data, add stuff, save data), to the computer each thread looks like a different bit of code, so there is significant overhead. GPUs are much better at this sort of truly massive parallelism and SIMD (single instruction, multiple data). A wrinkle to your specified task is that as mentioned above, simple numerical addition is a very memory-bound activity. If the task were a bit more complex (using the registers / CPU caches more), the upper bound of beneficial thread-use might well be higher than the 2 your results show. A very contrived example that I’ve used to explore this, while investigating the floating point benchmark stuff (https://github.com/Fourmilab/floating_point_benchmarks/pull/1/files) (if you hack it up with your additional set_autopthread_targ that would be very valuable): use strict; use warnings; use Time::HiRes qw(gettimeofday tv_interval); use PDL; use Inline Pdlpp => 'DATA';#, clean_after_build => 0; sub with_time (&) { my @t = gettimeofday(); &{$_[0]}(); printf "%g ms\n", tv_interval(\@t) * 1000; } my $N = $ARGV[0]; my ($a, $b, $c, $d) = (ones($N), sequence($N), sequence($N), sequence($N)); print "with intermediates\n"; with_time { print +($a + $b * $c + $d)->info, "\n" } for 1..5; print "manual loop-fusion\n"; with_time { print PDL::a_plus_b_c_plus_d($a, $b, $c, $d)->info, "\n" } for 1..5; __DATA__ __Pdlpp__ pp_def('a_plus_b_c_plus_d', Pars => 'a(); b(); c(); d(); [o]o()', GenericTypes => ['D'], Code => '$o() = $a() + $b() * $c() + $d();', ); It might be worth investigating whether making PDL pthreads have affinity to their “own” core helps (due to guaranteeing the same registers and per-core caches); it would only take adding similar code to that in the top answer to https://stackoverflow.com/questions/1407786/how-to-set-cpu-affinity-of-a-particular-pthread, to pthread_perform (https://github.com/PDLPorters/pdl/blob/1349803ea2aa67e7c854b1249338303b0e0ec351/Basic/Core/pdlmagic.c#L259-L266), probably just after the call to pthread_setspecific on line 262. Finally, ATLAS (https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software) uses per-installation automatic tuning to detect the optimum settings for parallelism (and other settings) on each system. That feels to me like it would be excessively hard to make work with PDL, but pull requests (with evidence that they make improvements) are always welcome! Best regards, Ed From: Luis Mochán<mailto:moc...@icf.unam.mx> Sent: 07 July 2022 00:45 To: Ed .<mailto:ej...@hotmail.com> Cc: Eric Wheeler<mailto:p...@lists.ewheeler.net>; pdl-general@lists.sourceforge.net<mailto:pdl-general@lists.sourceforge.net> Subject: Re: [Pdl-general] How do you create a set of cdouble matrices from (real, imag) values? By the way, I understand that the current PDL uses the total number of processors as a default for set_autopthread_targ. Is this a good default? The documentation says that the gain after 4 pthreads is not much, and using more could be counterproductive. For example: No threads: $ perl -MTime::HiRes=time -MPDL -E ' set_autopthread_targ(shift) if @ARGV; $x = zeroes(5000,5000); $y =5; $t=time; $z=$x+$y; say "Time: ", time-$t, "\nThreads: ", get_autopthread_actual(); ' 0 Time: 0.0815010070800781 Threads: 0 Two threads: $ perl -MTime::HiRes=time -MPDL -E ' set_autopthread_targ(shift) if @ARGV; $x = zeroes(5000,5000); $y =5; $t=time; $z=$x+$y; say "Time: ", time-$t, "\nThreads: ", get_autopthread_actual(); ' 2 Time: 0.0741269588470459 Threads: 2 There is a gain, but small. Four threads: mochan@tlahuilli:~$ perl -MTime::HiRes=time -MPDL -E ' set_autopthread_targ(shift) if @ARGV; $x = zeroes(5000,5000); $y =5; $t=time; $z=$x+$y; say "Time: ", time-$t, "\nThreads: ", get_autopthread_actual(); ' 4 Time: 0.0751421451568604 Threads: 4 Slightly worse. Default: mochan@tlahuilli:~$ perl -MTime::HiRes=time -MPDL -E ' set_autopthread_targ(shift) if @ARGV; $x = zeroes(5000,5000); $y =5; $t=time; $z=$x+$y; say "Time: ", time-$t, "\nThreads: ", get_autopthread_actual(); ' Time: 0.0870430469512939 Threads: 48 Even worse with 48 threads. Maybe this is a bad example to show the benefits of pthreads. Is there a simple more convincing case? Regards, Luis -- o W. Luis Mochán, | tel:(52)(777)329-1734 /<(*) Instituto de Ciencias Físicas, UNAM | fax:(52)(777)317-5388 `>/ /\ Av. Universidad s/n CP 62210 | (*)/\/ \ Cuernavaca, Morelos, México | moc...@fis.unam.mx /\_/\__/ GPG: 791EB9EB, C949 3F81 6D9B 1191 9A16 C2DF 5F0A C52B 791E B9EB
_______________________________________________ pdl-general mailing list pdl-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pdl-general