Re: Variable Block Distributions

Greg Titus Thu, 19 Feb 2015 13:56:32 -0800

Hi John --

It's hard to say without further information, particularly about theratio of remote references vs. local ones, but a 15-20x slowdownwouldn't be unusual in our experience. What is your timing technique?Are you using the built-in Chapel timing support, or are you timing theentire execution from outside the program, or? (You may have mentionedthis earlier and I didn't catch it.)

As to specifics ... you can probably skip collecting the init commstats, since most of that is built-in module initialization that isoutside your control. In the runn comm diags, 1600 of the 1604 remoteforks done by locale 1 are presumably its half of 3200 writeln() callsthat print the results. The number of get_nb (nonblocking GET)operations seems high for a program that is supposed to only to 1000remote loads on each locale. I take it this was run with the remotecache enabled? If it were disabled I would expect all of the get_nbvalues to be reported as plain gets instead, since a side effect ofenabling the remote cache is to turn blocking remote refs intononblocking ones. You might try running with the remote cache disabledto see how performance and comm diags results compare.


greg


On 2/19/2015 12:06 PM, John MacFrenz wrote:

Hi,
So... I did fiddle with chapel settings, booted machines and asanother node was running under virtualbox I tried different settingsfor it. Somehow the problem got fixed, and I'm unable to reproduce itany more though I set exact same settings I had before. Well, if Iface this issue again I try to gather more information about it...Might also be that booting helped, though I had already tried thatonce. Now the execution is still about 15-20 times slower, though I'vegot an impression that this is what could be excepted from chapel?Nodes had 4 and 2 cores (with HT 8 and 4) and 6GB and 4GB memory.Memory usage of the program was very small, 3 megabytes per node. Inthat setup I tried one locale per node, but I also tried using singlesystem with 2 to 4 locales.I also got comm diagnostics, though I'm afraid I don't have enoughknowledge to inteprete results. here's the results for two locales
Init comm diagnostics
(get = 0, get_nb = 0, put = 0, put_nb = 0, test_nb = 0, wait_nb = 0,try_nb = 0, fork = 12, fork_fast = 0, fork_nb = 13) (get = 0, get_nb =175, put = 0, put_nb = 8, test_nb = 0, wait_nb = 0, try_nb = 0, fork =17, fork_fast = 0, fork_nb = 0)
Run comm diagnostics
(get = 0, get_nb = 4222, put = 0, put_nb = 0, test_nb = 0, wait_nb =0, try_nb = 0, fork = 0, fork_fast = 0, fork_nb = 1602) (get = 0,get_nb = 7218, put = 0, put_nb = 800, test_nb = 0, wait_nb = 0, try_nb= 0, fork = 1604, fork_fast = 0, fork_nb = 0)During the run (excluding initialization), each locale should getabout 1000 real(64) values from another locale. On top of that resultsare printed with writeln with GASNET_ROUTE_OUTPUT=0, with total ofabout 3200 reals printed per locale which is included in run commdiagnostics. Do those comm statistics seem reasonable for that?19.02.2015, 18:58, "Michael Ferguson" <[email protected]<mailto:[email protected]>>:
Hi John -
How many cores does your system have? How much memory? How muchmemory is the program using? How many locales are you launching onthe single system? How many threads are you assigning to each locale?I’d bet that your problem is either from different threads contendingover the same processor resources (which you can limit as Gregpointed out with CHPL_RT_NUM_THREADS_PER_LOCALE ) or using to muchmemory since you are running many locales on a single machine – andChapel doesn’t currently try to reduce its per-locale resources whenoversubscribed in this manner.Of course, it could be the communication as well – you can checkthat. You can also instrument your program to print out communicationcounts (as I described in an earlier email – try mirroring the use ofCommDiagnostics inchapel-lang-github/test/performance/ferguson/remote-class-read.chpl )
You can also try running on a real cluster…
Cheers,
-michael
From: Greg Titus <[email protected] <mailto:[email protected]>>
Date: Thursday, February 19, 2015 at 11:41 AM
To: John MacFrenz <[email protected] <mailto:[email protected]>>
Cc: Michael Ferguson <[email protected]<mailto:[email protected]>>,"[email protected]<mailto:[email protected]>"<[email protected]<mailto:[email protected]>>
Subject: Re: Variable Block Distributions
Folks here pointed out a mistake I made in the discussion below: theenvironment variable that sets the number of threads and thusprocessor cores to use is CHPL_RT_NUM_THREADS_PER_LOCALE. I left offthe '_RT' part of that variable name below. It won't work if it's notspelled right! :-)
greg


On 2/19/2015 8:45 AM, Greg Titus wrote:
Hi John --
(You already know the following as general knowledge, but I thoughtI'd include it for people newer to multi-locale programming who maybe following this conversation.)
For context, with current Chapel it's normal for programs to suffersome performance loss when moving from single-locale to multi-localeexecution. Using multiple locales offers more opportunity forparallelism, but at the cost of reduced intra-task performance dueto network communication required for inter-locale variablereferences. The effect varies across programs depending on how muchremote communication they do, but since a remote reference caneasily take 1000 times as long as a local one, it doesn't take muchto have a big effect. For example, with Chapel 1.10 on a Cray XC wedon't see any drop-off in performance from 1 locale to 2 on theStream benchmark, but our Stream doesn't do many inter-locale memoryreferences. But for the RA benchmark, which does a lot ofinter-locale references (in fact that's what it's measuring), ourmulti-locale performance doesn't match that on a single locale untilwe get up to 8-32 locales, depending on circumstances. And the CrayXC has a very high-performance network compared to UDP or MPI overethernet.
That said, the >100x slowdown you're seeing seems a little high,unless your test case is really doing a lot of remote references. Ifit isn't, or at least shouldn't be, perhaps you're seeing a lot ofremote communication for internal references to meta-data, withinyour distribution code? If this is the case, then turning on remotecaching could well improve matters. In fact that might be a goodtest to rule this hypothesis in or out.
A secondary effect with GASNET_SPAWNFN=L could be oversubscriptionof the processor cores due to running more than one Chapel localeper compute node. To reduce the level of oversubscription you couldset CHPL_NUM_THREADS_PER_LOCALE to the number of compute-node coresdivided by the number of locales you're running on the compute node,but don't set it to less than 2 or you could deadlock/livelock dueto internal starvation. However, if you're seeing the same slowdownwith GASNET_SPAWNFN=S and one Chapel locale per compute node then Idon't think this is something that is afflicting you right now.
hope this helps,
greg


On 2/19/2015 4:42 AM, John MacFrenz wrote:
Hi,
I'll give --cache-remote a try later. However for now I'm facingsome problems which definitely should be solved first...The problem I'm having is that using GASNET_SPAWNF=L withUDP-conduit with more than one locale causes program to run _very_slowly. For example, with one locale my test program did take 0.20sec to run. With two locales it took 65 seconds. Same can beobserved when running with GASNET_SPAWNF=S with UDP conduit on twoseparate machines. Using MPI conduit didn't make difference. Here'sthe environment variables I'm using
CHPL_HOME: /home/share/chapel/chapel-git
script location: /home/share/chapel/chapel-git/util
CHPL_HOST_PLATFORM: linux32
CHPL_HOST_COMPILER: gnu
CHPL_TARGET_PLATFORM: linux32
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: unknown
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet
  CHPL_COMM_SUBSTRATE: udp
  CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: fifo
CHPL_LAUNCHER: amudprun
CHPL_TIMERS: generic
CHPL_MEM: cstdlib
CHPL_MAKE: gmake
CHPL_ATOMICS: intrinsics
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: none
CHPL_HWLOC: none
CHPL_REGEXP: none
CHPL_WIDE_POINTERS: struct
CHPL_LLVM: none
CHPL_AUX_FILESYS: none
Any idea what could be causing this? As I said on some previouspost my target environment is heterogeneous (all x86, though)commodity cluster with ethernet connections, so either UDP or MPIconduit is one I'd use..
18.02.2015, 23:41, "Greg Titus" <[email protected]> <mailto:[email protected]>:
Hi John --

A little bit of follow-up to what Michael says here ...
The "nemesis" he refers to is the internal name of the particularQthreads scheduler we use when CHPL_LOCALE_MODEL=flat. Ourunderstanding is that the nemesis scheduler currently doesn't moveqthreads (and by extension, Chapel tasks) from pthread to pthread,which would break the use of pthread local storage inside theremote caching implementation. But there are significant caveats here:
  * We use a different Qthreads scheduler when
    CHPL_LOCALE_MODEL=numa, and that one definitely does move
    qthreads (thus Chapel tasks) from pthread to pthread.
  * We can't guarantee that we'll always use "nemesis" with the
    flat locale model.
  * We can't guarantee that, even if we do keep using it,
    "nemesis" will continue to not move qthreads (thus Chapel
    tasks) from pthread to pthread.
Taken together, this basically says that although we haven'tobserved remote caching failures with qthreads, that shouldn't betaken as evidence that it definitely does work now or will work inthe future.
greg


On 2/18/2015 2:31 PM, Michael Ferguson wrote:
Hi -
One more thing about the --cache-remote feature, just to be clearand for future reference:
The remote caching depends on pthread local storage, and Chapeltask movement among worker pthreads in Qthreads-based taskingcould break it. So far we haven't seen this happen, but we cannotguarantee it won't. Symptoms of a failure could include silentwrong answers or segfaults, either of which could be solid orintermittent/sporadic.
I *think* that this problem won't come up with the nemesisqthreads scheduler, but we need to do some careful analysisbefore we can declare the --cache-remote feature safe to use withqthreads.
Cheers,

-michael


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk


_______________________________________________
Chapel-users mailing list
[email protected]  
<mailto:[email protected]>https://lists.sourceforge.net/lists/listinfo/chapel-users
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk


_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Variable Block Distributions

Reply via email to