Re: Variable Block Distributions

Greg Titus Thu, 19 Feb 2015 15:34:53 -0800

Hi John --

Yes, the much larger number of GETs with your custom distribution doesseem like it would be meta-data. Probably you'll want to go back toconsulting Brad about that. I know something about the runtime but verylittle about distributions, especially at the level of detail you'll need.


greg


On 2/19/2015 3:34 PM, John MacFrenz wrote:

Hi,
It's hard to say without further information, particularly about theratio of remote references vs. local ones, but a 15-20x slowdownwouldn't be unusual in our experience. What is your timing technique?Are you using the built-in Chapel timing support, or are you timingthe entire execution from outside the program, or? (You may havementioned this earlier and I didn't catch it.)
With two locales, each locale gets assigned half of the array. Commdiagnostics in previous post were associated to this case. I'm usingchapel's timer object for timing, with separate timings forinitializing domains, arrays etc., and another for actual computationloops.
As to specifics ... you can probably skip collecting the init commstats, since most of that is built-in module initialization that isoutside your control. In the runn comm diags, 1600 of the 1604 remoteforks done by locale 1 are presumably its half of 3200 writeln()calls that print the results. The number of get_nb (nonblocking GET)operations seems high for a program that is supposed to only to 1000remote loads on each locale. I take it this was run with the remotecache enabled? If it were disabled I would expect all of the get_nbvalues to be reported as plain gets instead, since a side effect ofenabling the remote cache is to turn blocking remote refs intononblocking ones. You might try running with the remote cachedisabled to see how performance and comm diags results compare.
Actually there should be 3200 values (actually there is a bug soactual number is 3400) printed per locale, so totalling to 6400(6800). I'm calling writeln on array of reals and not on individualvalues. And yes, that was with remote cache enabled.I tried running comm diagnostics with standard block dist (I noticed Ihave been rather unclear about whether I was using block dist or myown dist, sorry for that) and got results which seem to make much moresense.
Standard block dist, with --fast --cache-remote --no-local :
(get = 0, get_nb = 4222, put = 0, put_nb = 0, test_nb = 0, wait_nb =0, try_nb = 0, fork = 0, fork_fast = 0, fork_nb = 1602) (get = 0,get_nb = 810, put = 0, put_nb = 0, test_nb = 0, wait_nb = 0, try_nb =0, fork = 1604, fork_fast = 0, fork_nb = 0)
My custom dist, no flags passed to compiler:
(get = 4232, get_nb = 0, put = 0, put_nb = 0, test_nb = 0, wait_nb =0, try_nb = 0, fork = 0, fork_fast = 0, fork_nb = 1602) (get = 14435,get_nb = 0, put = 800, put_nb = 0, test_nb = 0, wait_nb = 0, try_nb =0, fork = 1604, fork_fast = 0, fork_nb = 0)
My custom dist with --no-local --fast
(get = 4232, get_nb = 0, put = 0, put_nb = 0, test_nb = 0, wait_nb =0, try_nb = 0, fork = 0, fork_fast = 0, fork_nb = 1602) (get = 9629,get_nb = 0, put = 800, put_nb = 0, test_nb = 0, wait_nb = 0, try_nb =0, fork = 1604, fork_fast = 0, fork_nb = 0)So that extra communication overhead seems to be associated with mycustom distribution. I really can't think what could be causing this,since all data needed for indexing operations should be privatized.Any hints how I could try to find out what's causing this? I can alsopost the code next week (needs commenting before I'd like to postit...) if someone had time to take a look at it.
greg


On 2/19/2015 12:06 PM, John MacFrenz wrote:
Hi,
So... I did fiddle with chapel settings, booted machines and asanother node was running under virtualbox I tried different settingsfor it. Somehow the problem got fixed, and I'm unable to reproduceit any more though I set exact same settings I had before. Well, ifI face this issue again I try to gather more information about it...Might also be that booting helped, though I had already tried thatonce. Now the execution is still about 15-20 times slower, thoughI've got an impression that this is what could be excepted from chapel?Nodes had 4 and 2 cores (with HT 8 and 4) and 6GB and 4GB memory.Memory usage of the program was very small, 3 megabytes per node. Inthat setup I tried one locale per node, but I also tried usingsingle system with 2 to 4 locales.I also got comm diagnostics, though I'm afraid I don't have enoughknowledge to inteprete results. here's the results for two locales
Init comm diagnostics
(get = 0, get_nb = 0, put = 0, put_nb = 0, test_nb = 0, wait_nb = 0,try_nb = 0, fork = 12, fork_fast = 0, fork_nb = 13) (get = 0, get_nb= 175, put = 0, put_nb = 8, test_nb = 0, wait_nb = 0, try_nb = 0,fork = 17, fork_fast = 0, fork_nb = 0)
Run comm diagnostics
(get = 0, get_nb = 4222, put = 0, put_nb = 0, test_nb = 0, wait_nb =0, try_nb = 0, fork = 0, fork_fast = 0, fork_nb = 1602) (get = 0,get_nb = 7218, put = 0, put_nb = 800, test_nb = 0, wait_nb = 0,try_nb = 0, fork = 1604, fork_fast = 0, fork_nb = 0)During the run (excluding initialization), each locale should getabout 1000 real(64) values from another locale. On top of thatresults are printed with writeln with GASNET_ROUTE_OUTPUT=0, withtotal of about 3200 reals printed per locale which is included inrun comm diagnostics. Do those comm statistics seem reasonable for that?19.02.2015, 18:58, "Michael Ferguson" <[email protected]<mailto:[email protected]>>:
Hi John -
How many cores does your system have? How much memory? How muchmemory is the program using? How many locales are you launching onthe single system? How many threads are you assigning to each locale?I’d bet that your problem is either from different threadscontending over the same processor resources (which you can limitas Greg pointed out with CHPL_RT_NUM_THREADS_PER_LOCALE ) or usingto much memory since you are running many locales on a singlemachine – and Chapel doesn’t currently try to reduce its per-localeresources when oversubscribed in this manner.Of course, it could be the communication as well – you can checkthat. You can also instrument your program to print outcommunication counts (as I described in an earlier email – trymirroring the use of CommDiagnostics inchapel-lang-github/test/performance/ferguson/remote-class-read.chpl )
You can also try running on a real cluster…
Cheers,
-michael
From: Greg Titus <[email protected] <mailto:[email protected]>>
Date: Thursday, February 19, 2015 at 11:41 AM
To: John MacFrenz <[email protected] <mailto:[email protected]>>
Cc: Michael Ferguson <[email protected]<mailto:[email protected]>>,"[email protected]<mailto:[email protected]>"<[email protected]<mailto:[email protected]>>
Subject: Re: Variable Block Distributions
Folks here pointed out a mistake I made in the discussion below:the environment variable that sets the number of threads and thusprocessor cores to use is CHPL_RT_NUM_THREADS_PER_LOCALE. I leftoff the '_RT' part of that variable name below. It won't work ifit's not spelled right! :-)
greg


On 2/19/2015 8:45 AM, Greg Titus wrote:
Hi John --
(You already know the following as general knowledge, but Ithought I'd include it for people newer to multi-localeprogramming who may be following this conversation.)
For context, with current Chapel it's normal for programs tosuffer some performance loss when moving from single-locale tomulti-locale execution. Using multiple locales offers moreopportunity for parallelism, but at the cost of reduced intra-taskperformance due to network communication required for inter-localevariable references. The effect varies across programs dependingon how much remote communication they do, but since a remotereference can easily take 1000 times as long as a local one, itdoesn't take much to have a big effect. For example, with Chapel1.10 on a Cray XC we don't see any drop-off in performance from 1locale to 2 on the Stream benchmark, but our Stream doesn't domany inter-locale memory references. But for the RA benchmark,which does a lot of inter-locale references (in fact that's whatit's measuring), our multi-locale performance doesn't match thaton a single locale until we get up to 8-32 locales, depending oncircumstances. And the Cray XC has a very high-performance networkcompared to UDP or MPI over ethernet.
That said, the >100x slowdown you're seeing seems a little high,unless your test case is really doing a lot of remote references.If it isn't, or at least shouldn't be, perhaps you're seeing a lotof remote communication for internal references to meta-data,within your distribution code? If this is the case, then turningon remote caching could well improve matters. In fact that mightbe a good test to rule this hypothesis in or out.
A secondary effect with GASNET_SPAWNFN=L could be oversubscriptionof the processor cores due to running more than one Chapel localeper compute node. To reduce the level of oversubscription youcould set CHPL_NUM_THREADS_PER_LOCALE to the number ofcompute-node cores divided by the number of locales you're runningon the compute node, but don't set it to less than 2 or you coulddeadlock/livelock due to internal starvation. However, if you'reseeing the same slowdown with GASNET_SPAWNFN=S and one Chapellocale per compute node then I don't think this is something thatis afflicting you right now.
hope this helps,
greg


On 2/19/2015 4:42 AM, John MacFrenz wrote:
Hi,
I'll give --cache-remote a try later. However for now I'm facingsome problems which definitely should be solved first...The problem I'm having is that using GASNET_SPAWNF=L withUDP-conduit with more than one locale causes program to run_very_ slowly. For example, with one locale my test program didtake 0.20 sec to run. With two locales it took 65 seconds. Samecan be observed when running with GASNET_SPAWNF=S with UDPconduit on two separate machines. Using MPI conduit didn't makedifference. Here's the environment variables I'm using
CHPL_HOME: /home/share/chapel/chapel-git
script location: /home/share/chapel/chapel-git/util
CHPL_HOST_PLATFORM: linux32
CHPL_HOST_COMPILER: gnu
CHPL_TARGET_PLATFORM: linux32
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: unknown
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet
  CHPL_COMM_SUBSTRATE: udp
  CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: fifo
CHPL_LAUNCHER: amudprun
CHPL_TIMERS: generic
CHPL_MEM: cstdlib
CHPL_MAKE: gmake
CHPL_ATOMICS: intrinsics
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: none
CHPL_HWLOC: none
CHPL_REGEXP: none
CHPL_WIDE_POINTERS: struct
CHPL_LLVM: none
CHPL_AUX_FILESYS: none
Any idea what could be causing this? As I said on some previouspost my target environment is heterogeneous (all x86, though)commodity cluster with ethernet connections, so either UDP or MPIconduit is one I'd use..
18.02.2015, 23:41, "Greg Titus" <[email protected]> <mailto:[email protected]>:
Hi John --

A little bit of follow-up to what Michael says here ...
The "nemesis" he refers to is the internal name of theparticular Qthreads scheduler we use whenCHPL_LOCALE_MODEL=flat. Our understanding is that the nemesisscheduler currently doesn't move qthreads (and by extension,Chapel tasks) from pthread to pthread, which would break the useof pthread local storage inside the remote cachingimplementation. But there are significant caveats here:
  * We use a different Qthreads scheduler when
    CHPL_LOCALE_MODEL=numa, and that one definitely does move
    qthreads (thus Chapel tasks) from pthread to pthread.
  * We can't guarantee that we'll always use "nemesis" with the
    flat locale model.
  * We can't guarantee that, even if we do keep using it,
    "nemesis" will continue to not move qthreads (thus Chapel
    tasks) from pthread to pthread.
Taken together, this basically says that although we haven'tobserved remote caching failures with qthreads, that shouldn'tbe taken as evidence that it definitely does work now or willwork in the future.
greg


On 2/18/2015 2:31 PM, Michael Ferguson wrote:
Hi -
One more thing about the --cache-remote feature, just to beclear and for future reference:
The remote caching depends on pthread local storage, and Chapeltask movement among worker pthreads in Qthreads-based taskingcould break it. So far we haven't seen this happen, but wecannot guarantee it won't. Symptoms of a failure could includesilent wrong answers or segfaults, either of which could besolid or intermittent/sporadic.
I *think* that this problem won't come up with the nemesisqthreads scheduler, but we need to do some careful analysisbefore we can declare the --cache-remote feature safe to usewith qthreads.
Cheers,

-michael


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk


_______________________________________________
Chapel-users mailing list
[email protected]  
<mailto:[email protected]>https://lists.sourceforge.net/lists/listinfo/chapel-users
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk


_______________________________________________
Chapel-users mailing list
[email protected]  <mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/chapel-users

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Variable Block Distributions

Reply via email to