Hi John --
Yes, the much larger number of GETs with your custom distribution does
seem like it would be meta-data. Probably you'll want to go back to
consulting Brad about that. I know something about the runtime but very
little about distributions, especially at the level of detail you'll need.
greg
On 2/19/2015 3:34 PM, John MacFrenz wrote:
Hi,
It's hard to say without further information, particularly about the
ratio of remote references vs. local ones, but a 15-20x slowdown
wouldn't be unusual in our experience. What is your timing technique?
Are you using the built-in Chapel timing support, or are you timing
the entire execution from outside the program, or? (You may have
mentioned this earlier and I didn't catch it.)
With two locales, each locale gets assigned half of the array. Comm
diagnostics in previous post were associated to this case. I'm using
chapel's timer object for timing, with separate timings for
initializing domains, arrays etc., and another for actual computation
loops.
As to specifics ... you can probably skip collecting the init comm
stats, since most of that is built-in module initialization that is
outside your control. In the runn comm diags, 1600 of the 1604 remote
forks done by locale 1 are presumably its half of 3200 writeln()
calls that print the results. The number of get_nb (nonblocking GET)
operations seems high for a program that is supposed to only to 1000
remote loads on each locale. I take it this was run with the remote
cache enabled? If it were disabled I would expect all of the get_nb
values to be reported as plain gets instead, since a side effect of
enabling the remote cache is to turn blocking remote refs into
nonblocking ones. You might try running with the remote cache
disabled to see how performance and comm diags results compare.
Actually there should be 3200 values (actually there is a bug so
actual number is 3400) printed per locale, so totalling to 6400
(6800). I'm calling writeln on array of reals and not on individual
values. And yes, that was with remote cache enabled.
I tried running comm diagnostics with standard block dist (I noticed I
have been rather unclear about whether I was using block dist or my
own dist, sorry for that) and got results which seem to make much more
sense.
Standard block dist, with --fast --cache-remote --no-local :
(get = 0, get_nb = 4222, put = 0, put_nb = 0, test_nb = 0, wait_nb =
0, try_nb = 0, fork = 0, fork_fast = 0, fork_nb = 1602) (get = 0,
get_nb = 810, put = 0, put_nb = 0, test_nb = 0, wait_nb = 0, try_nb =
0, fork = 1604, fork_fast = 0, fork_nb = 0)
My custom dist, no flags passed to compiler:
(get = 4232, get_nb = 0, put = 0, put_nb = 0, test_nb = 0, wait_nb =
0, try_nb = 0, fork = 0, fork_fast = 0, fork_nb = 1602) (get = 14435,
get_nb = 0, put = 800, put_nb = 0, test_nb = 0, wait_nb = 0, try_nb =
0, fork = 1604, fork_fast = 0, fork_nb = 0)
My custom dist with --no-local --fast
(get = 4232, get_nb = 0, put = 0, put_nb = 0, test_nb = 0, wait_nb =
0, try_nb = 0, fork = 0, fork_fast = 0, fork_nb = 1602) (get = 9629,
get_nb = 0, put = 800, put_nb = 0, test_nb = 0, wait_nb = 0, try_nb =
0, fork = 1604, fork_fast = 0, fork_nb = 0)
So that extra communication overhead seems to be associated with my
custom distribution. I really can't think what could be causing this,
since all data needed for indexing operations should be privatized.
Any hints how I could try to find out what's causing this? I can also
post the code next week (needs commenting before I'd like to post
it...) if someone had time to take a look at it.
greg
On 2/19/2015 12:06 PM, John MacFrenz wrote:
Hi,
So... I did fiddle with chapel settings, booted machines and as
another node was running under virtualbox I tried different settings
for it. Somehow the problem got fixed, and I'm unable to reproduce
it any more though I set exact same settings I had before. Well, if
I face this issue again I try to gather more information about it...
Might also be that booting helped, though I had already tried that
once. Now the execution is still about 15-20 times slower, though
I've got an impression that this is what could be excepted from chapel?
Nodes had 4 and 2 cores (with HT 8 and 4) and 6GB and 4GB memory.
Memory usage of the program was very small, 3 megabytes per node. In
that setup I tried one locale per node, but I also tried using
single system with 2 to 4 locales.
I also got comm diagnostics, though I'm afraid I don't have enough
knowledge to inteprete results. here's the results for two locales
Init comm diagnostics
(get = 0, get_nb = 0, put = 0, put_nb = 0, test_nb = 0, wait_nb = 0,
try_nb = 0, fork = 12, fork_fast = 0, fork_nb = 13) (get = 0, get_nb
= 175, put = 0, put_nb = 8, test_nb = 0, wait_nb = 0, try_nb = 0,
fork = 17, fork_fast = 0, fork_nb = 0)
Run comm diagnostics
(get = 0, get_nb = 4222, put = 0, put_nb = 0, test_nb = 0, wait_nb =
0, try_nb = 0, fork = 0, fork_fast = 0, fork_nb = 1602) (get = 0,
get_nb = 7218, put = 0, put_nb = 800, test_nb = 0, wait_nb = 0,
try_nb = 0, fork = 1604, fork_fast = 0, fork_nb = 0)
During the run (excluding initialization), each locale should get
about 1000 real(64) values from another locale. On top of that
results are printed with writeln with GASNET_ROUTE_OUTPUT=0, with
total of about 3200 reals printed per locale which is included in
run comm diagnostics. Do those comm statistics seem reasonable for that?
19.02.2015, 18:58, "Michael Ferguson" <[email protected]
<mailto:[email protected]>>:
Hi John -
How many cores does your system have? How much memory? How much
memory is the program using? How many locales are you launching on
the single system? How many threads are you assigning to each locale?
I’d bet that your problem is either from different threads
contending over the same processor resources (which you can limit
as Greg pointed out with CHPL_RT_NUM_THREADS_PER_LOCALE ) or using
to much memory since you are running many locales on a single
machine – and Chapel doesn’t currently try to reduce its per-locale
resources when oversubscribed in this manner.
Of course, it could be the communication as well – you can check
that. You can also instrument your program to print out
communication counts (as I described in an earlier email – try
mirroring the use of CommDiagnostics in
chapel-lang-github/test/performance/ferguson/remote-class-read.chpl )
You can also try running on a real cluster…
Cheers,
-michael
From: Greg Titus <[email protected] <mailto:[email protected]>>
Date: Thursday, February 19, 2015 at 11:41 AM
To: John MacFrenz <[email protected] <mailto:[email protected]>>
Cc: Michael Ferguson <[email protected]
<mailto:[email protected]>>,
"[email protected]
<mailto:[email protected]>"
<[email protected]
<mailto:[email protected]>>
Subject: Re: Variable Block Distributions
Folks here pointed out a mistake I made in the discussion below:
the environment variable that sets the number of threads and thus
processor cores to use is CHPL_RT_NUM_THREADS_PER_LOCALE. I left
off the '_RT' part of that variable name below. It won't work if
it's not spelled right! :-)
greg
On 2/19/2015 8:45 AM, Greg Titus wrote:
Hi John --
(You already know the following as general knowledge, but I
thought I'd include it for people newer to multi-locale
programming who may be following this conversation.)
For context, with current Chapel it's normal for programs to
suffer some performance loss when moving from single-locale to
multi-locale execution. Using multiple locales offers more
opportunity for parallelism, but at the cost of reduced intra-task
performance due to network communication required for inter-locale
variable references. The effect varies across programs depending
on how much remote communication they do, but since a remote
reference can easily take 1000 times as long as a local one, it
doesn't take much to have a big effect. For example, with Chapel
1.10 on a Cray XC we don't see any drop-off in performance from 1
locale to 2 on the Stream benchmark, but our Stream doesn't do
many inter-locale memory references. But for the RA benchmark,
which does a lot of inter-locale references (in fact that's what
it's measuring), our multi-locale performance doesn't match that
on a single locale until we get up to 8-32 locales, depending on
circumstances. And the Cray XC has a very high-performance network
compared to UDP or MPI over ethernet.
That said, the >100x slowdown you're seeing seems a little high,
unless your test case is really doing a lot of remote references.
If it isn't, or at least shouldn't be, perhaps you're seeing a lot
of remote communication for internal references to meta-data,
within your distribution code? If this is the case, then turning
on remote caching could well improve matters. In fact that might
be a good test to rule this hypothesis in or out.
A secondary effect with GASNET_SPAWNFN=L could be oversubscription
of the processor cores due to running more than one Chapel locale
per compute node. To reduce the level of oversubscription you
could set CHPL_NUM_THREADS_PER_LOCALE to the number of
compute-node cores divided by the number of locales you're running
on the compute node, but don't set it to less than 2 or you could
deadlock/livelock due to internal starvation. However, if you're
seeing the same slowdown with GASNET_SPAWNFN=S and one Chapel
locale per compute node then I don't think this is something that
is afflicting you right now.
hope this helps,
greg
On 2/19/2015 4:42 AM, John MacFrenz wrote:
Hi,
I'll give --cache-remote a try later. However for now I'm facing
some problems which definitely should be solved first...
The problem I'm having is that using GASNET_SPAWNF=L with
UDP-conduit with more than one locale causes program to run
_very_ slowly. For example, with one locale my test program did
take 0.20 sec to run. With two locales it took 65 seconds. Same
can be observed when running with GASNET_SPAWNF=S with UDP
conduit on two separate machines. Using MPI conduit didn't make
difference. Here's the environment variables I'm using
CHPL_HOME: /home/share/chapel/chapel-git
script location: /home/share/chapel/chapel-git/util
CHPL_HOST_PLATFORM: linux32
CHPL_HOST_COMPILER: gnu
CHPL_TARGET_PLATFORM: linux32
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: unknown
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet
CHPL_COMM_SUBSTRATE: udp
CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: fifo
CHPL_LAUNCHER: amudprun
CHPL_TIMERS: generic
CHPL_MEM: cstdlib
CHPL_MAKE: gmake
CHPL_ATOMICS: intrinsics
CHPL_NETWORK_ATOMICS: none
CHPL_GMP: none
CHPL_HWLOC: none
CHPL_REGEXP: none
CHPL_WIDE_POINTERS: struct
CHPL_LLVM: none
CHPL_AUX_FILESYS: none
Any idea what could be causing this? As I said on some previous
post my target environment is heterogeneous (all x86, though)
commodity cluster with ethernet connections, so either UDP or MPI
conduit is one I'd use..
18.02.2015, 23:41, "Greg Titus" <[email protected]> <mailto:[email protected]>:
Hi John --
A little bit of follow-up to what Michael says here ...
The "nemesis" he refers to is the internal name of the
particular Qthreads scheduler we use when
CHPL_LOCALE_MODEL=flat. Our understanding is that the nemesis
scheduler currently doesn't move qthreads (and by extension,
Chapel tasks) from pthread to pthread, which would break the use
of pthread local storage inside the remote caching
implementation. But there are significant caveats here:
* We use a different Qthreads scheduler when
CHPL_LOCALE_MODEL=numa, and that one definitely does move
qthreads (thus Chapel tasks) from pthread to pthread.
* We can't guarantee that we'll always use "nemesis" with the
flat locale model.
* We can't guarantee that, even if we do keep using it,
"nemesis" will continue to not move qthreads (thus Chapel
tasks) from pthread to pthread.
Taken together, this basically says that although we haven't
observed remote caching failures with qthreads, that shouldn't
be taken as evidence that it definitely does work now or will
work in the future.
greg
On 2/18/2015 2:31 PM, Michael Ferguson wrote:
Hi -
One more thing about the --cache-remote feature, just to be
clear and for future reference:
The remote caching depends on pthread local storage, and Chapel
task movement among worker pthreads in Qthreads-based tasking
could break it. So far we haven't seen this happen, but we
cannot guarantee it won't. Symptoms of a failure could include
silent wrong answers or segfaults, either of which could be
solid or intermittent/sporadic.
I *think* that this problem won't come up with the nemesis
qthreads scheduler, but we need to do some careful analysis
before we can declare the --cache-remote feature safe to use
with qthreads.
Cheers,
-michael
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-users mailing list
[email protected]
<mailto:[email protected]>https://lists.sourceforge.net/lists/listinfo/chapel-users
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-users mailing list
[email protected] <mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/chapel-users
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users