Hi Luis,

To massively oversimplify https://en.wikipedia.org/wiki/Amdahl%27s_law, making 
things go faster always has a limiting factor. In parallel computation in a 
shared-memory model (as POSIX threads use), a major limiting factor will be 
that system’s memory bandwidth.

An additional complication to achieving huge parallelism for the task in your 
test code is that while we as humans know that here, each thread is doing the 
same thing (load data, add stuff, save data), to the computer each thread looks 
like a different bit of code, so there is significant overhead. GPUs are much 
better at this sort of truly massive parallelism and SIMD (single instruction, 
multiple data).

A wrinkle to your specified task is that as mentioned above, simple numerical 
addition is a very memory-bound activity. If the task were a bit more complex 
(using the registers / CPU caches more), the upper bound of beneficial 
thread-use might well be higher than the 2 your results show. A very contrived 
example that I’ve used to explore this, while investigating the floating point 
benchmark stuff 
(https://github.com/Fourmilab/floating_point_benchmarks/pull/1/files) (if you 
hack it up with your additional set_autopthread_targ that would be very 
valuable):

use strict;
use warnings;
use Time::HiRes qw(gettimeofday tv_interval);
use PDL;
use Inline Pdlpp => 'DATA';#, clean_after_build => 0;

sub with_time (&) { my @t = gettimeofday(); &{$_[0]}(); printf "%g ms\n", 
tv_interval(\@t) * 1000; }

my $N = $ARGV[0];
my ($a, $b, $c, $d) = (ones($N), sequence($N), sequence($N), sequence($N));

print "with intermediates\n";
with_time { print +($a + $b * $c + $d)->info, "\n" } for 1..5;
print "manual loop-fusion\n";
with_time { print PDL::a_plus_b_c_plus_d($a, $b, $c, $d)->info, "\n" } for 1..5;

__DATA__
__Pdlpp__
pp_def('a_plus_b_c_plus_d',
  Pars => 'a(); b(); c(); d(); [o]o()',
  GenericTypes => ['D'],
  Code => '$o() = $a() + $b() * $c() + $d();',
);

It might be worth investigating whether making PDL pthreads have affinity to 
their “own” core helps (due to guaranteeing the same registers and per-core 
caches); it would only take adding similar code to that in the top answer to 
https://stackoverflow.com/questions/1407786/how-to-set-cpu-affinity-of-a-particular-pthread,
 to pthread_perform 
(https://github.com/PDLPorters/pdl/blob/1349803ea2aa67e7c854b1249338303b0e0ec351/Basic/Core/pdlmagic.c#L259-L266),
 probably just after the call to pthread_setspecific on line 262.

Finally, ATLAS 
(https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software) 
uses per-installation automatic tuning to detect the optimum settings for 
parallelism (and other settings) on each system. That feels to me like it would 
be excessively hard to make work with PDL, but pull requests (with evidence 
that they make improvements) are always welcome!

Best regards,
Ed

From: Luis Mochán<mailto:moc...@icf.unam.mx>
Sent: 07 July 2022 00:45
To: Ed .<mailto:ej...@hotmail.com>
Cc: Eric Wheeler<mailto:p...@lists.ewheeler.net>; 
pdl-general@lists.sourceforge.net<mailto:pdl-general@lists.sourceforge.net>
Subject: Re: [Pdl-general] How do you create a set of cdouble matrices from 
(real, imag) values?

By the way, I understand that the current PDL uses the total number of
processors as a default for set_autopthread_targ. Is this a good
default? The documentation says that the gain after 4 pthreads is not
much, and using more could be counterproductive. For example:

No threads:

$ perl -MTime::HiRes=time -MPDL -E '
set_autopthread_targ(shift) if @ARGV; $x = zeroes(5000,5000);
$y =5; $t=time; $z=$x+$y; say "Time: ", time-$t, "\nThreads: ",
get_autopthread_actual();
' 0
Time: 0.0815010070800781
Threads: 0

Two threads:

$ perl -MTime::HiRes=time -MPDL -E '
set_autopthread_targ(shift) if @ARGV; $x = zeroes(5000,5000);
$y =5; $t=time; $z=$x+$y; say "Time: ", time-$t, "\nThreads: ",
get_autopthread_actual();
' 2
Time: 0.0741269588470459
Threads: 2

There is a gain, but small.

Four threads:

mochan@tlahuilli:~$ perl -MTime::HiRes=time -MPDL -E '
set_autopthread_targ(shift) if @ARGV; $x = zeroes(5000,5000);
$y =5; $t=time; $z=$x+$y; say "Time: ", time-$t, "\nThreads: ",
get_autopthread_actual();
' 4
Time: 0.0751421451568604
Threads: 4

Slightly worse.

Default:

mochan@tlahuilli:~$ perl -MTime::HiRes=time -MPDL -E '
set_autopthread_targ(shift) if @ARGV; $x = zeroes(5000,5000);
$y =5; $t=time; $z=$x+$y; say "Time: ", time-$t, "\nThreads: ",
get_autopthread_actual();
'
Time: 0.0870430469512939
Threads: 48

Even worse with 48 threads.

Maybe this is a bad example to show the benefits of pthreads. Is there
a simple more convincing case?

Regards,
Luis


--

                                                                  o
W. Luis Mochán,                      | tel:(52)(777)329-1734     /<(*)
Instituto de Ciencias Físicas, UNAM  | fax:(52)(777)317-5388     `>/   /\
Av. Universidad s/n CP 62210         |                           (*)/\/  \
Cuernavaca, Morelos, México          | moc...@fis.unam.mx   /\_/\__/
GPG: 791EB9EB, C949 3F81 6D9B 1191 9A16  C2DF 5F0A C52B 791E B9EB

_______________________________________________
pdl-general mailing list
pdl-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pdl-general

Reply via email to