Hi Pieter, I think that's a fair conclusion. Thanks for being patient and providing the code examples and timings. On my end, I'll try and work on creating some mini-benchmarks to track this difference more clearly (for both the vectorAdd and 1D-convolution use-cases).
As far as understanding overhead for direct-indexing in BlockDist, here are some issues that I'm aware of: - Overhead when the compiler is unable to infer that data is local. In such cases we introduce "wide pointers", which the back-end C compiler may not be able to optimize as effectively as a normal pointer. This is what may have been impacted by switching from gcc 4.4 to 6.2. - Overhead to check which locale "owns" an index - Possible overhead for privatized objects Though this isn't a satisfying conclusion, hopefully it helps to get some sense of how different patterns perform. -Ben Harshbarger On 11/17/16, 1:28 AM, "Pieter Hijma" <[email protected]> wrote: Hi Ben, Same setup, testing with GCC 6.2.0, single-locale in directory 'datapar', multi-locale in directory 'datapar-dist'. Similarly to the 1D-convolution case, the Makefiles are the same: $ diff datapar/Makefile datapar/Makefile This means that I compile both programs with: $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \ vectoradd.chpl The job files are also the same: $ diff datapar/das4.job datapar/das4.job The contents of das4.job (an SGE script), basically the same as for the 1D-convolution: --- #!/bin/bash #$ -l h_rt=0:15:00 #$ -N VECTORADD #$ -cwd . ~/.bashrc SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '` export CHPL_COMM=gasnet export CHPL_COMM_SUBSTRATE=ibv export CHPL_LAUNCHER=gasnetrun_ibv export CHPL_RT_NUM_THREADS_PER_LOCALE=16 export GASNET_IBV_SPAWNER=ssh export GASNET_PHYSMEM_MAX=1G export GASNET_SSH_SERVERS="$SSH_SERVERS" APP=./vectoradd ARGS=$* $APP $ARGS --- The two source files differ in the use of domain-maps: $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl 2a3 > use BlockDist; 7c8 < const ProblemDomain : domain(1) = {0..#n}; --- > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) = {0..#n}; ERROR: 1 The output of 'datapar', originally the single-locale version: addNoDomain n: 536870912 Time: 0.329722s GFLOPS: 1.62825 addZip n: 536870912 Time: 0.328751s GFLOPS: 1.63306 addForall n: 536870912 Time: 0.325768s GFLOPS: 1.64802 addCollective n: 536870912 Time: 0.330918s GFLOPS: 1.62237 The output of 'datapar-dist', the multi-locale version: addNoDomain n: 536870912 Time: 0.373368s GFLOPS: 1.43791 addZip n: 536870912 Time: 0.372561s GFLOPS: 1.44103 addForall n: 536870912 Time: 2.66822s GFLOPS: 0.201209 addCollective n: 536870912 Time: 0.36856s GFLOPS: 1.45667 I guess the conclusion is that also in this case, the use of the BlockDist has an effect, minor overall, and major when indexing directly. Kind regards, Pieter Hijma On 16/11/16 20:32, Ben Harshbarger wrote: > Hi Pieter, > > Do you still see a problem with vectorAdd.chpl? I think 1D-convolution has a somewhat separate performance issue due to accessing with arbitrary indices (more overhead because we don't know if the index is local). If vectorAdd isn't performing well, then that could hurt 1D-convolution too. > > -Ben Harshbarger > > On 11/16/16, 3:33 AM, "Pieter Hijma" <[email protected]> wrote: > > Hi Ben, > > Good suggestion, I'm going to test with the 1D-convolution program and I > use the Chapel version compiled with GCC 6.2.0. The single-locale > version is in directory 'datapar' and the multi-locale version is in > directory 'datapar-dist'. > > The Makefiles are now the same: > > $ diff datapar/Makefile datapar/Makefile > > This means that I compile both programs with: > > $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 1D-convolution \ > --fast 1D-convolution.chpl > > The job files are also the same: > > $ diff datapar/das4.job datapar/das4.job > > The contents of das4.job (an SGE script): > > ----- > #!/bin/bash > #$ -l h_rt=0:15:00 > #$ -N CONVOLUTION_1D > #$ -cwd > > . ~/.bashrc > > SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '` > > export CHPL_COMM=gasnet > export CHPL_COMM_SUBSTRATE=ibv > export CHPL_LAUNCHER=gasnetrun_ibv > export CHPL_RT_NUM_THREADS_PER_LOCALE=16 > > export GASNET_IBV_SPAWNER=ssh > export GASNET_PHYSMEM_MAX=1G > export GASNET_SSH_SERVERS="$SSH_SERVERS" > > APP=./1D-convolution > ARGS=$* > > $APP $ARGS > ------ > > The difference between the two source files is now basically that the > single-locale version has only the default domain map, whereas the > multi-locale has the BlockDist domain map: > > $ diff datapar/1D-convolution.chpl datapar-dist/1D-convolution.chpl > 2a3 > > use BlockDist; > 6c7 > < const ProblemDomain : domain(1) = {0..#n}; > --- > > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) > = {0..#n}; > ERROR: 1 > > > The output of 'datapar', the original single-locale version without > BlockDist: > > convolveIndices, n: 536870912 > Time: 0.319077s > GFLOPS: 5.04772 > > convolveZip, n: 536870912 > Time: 0.320788s > GFLOPS: 5.0208 > > > The output of 'datapar-dist', the original multi-locale version with > BlockDist: > > convolveIndices, n: 536870912 > Time: 3.1422s > GFLOPS: 0.512575 > > convolveZip, n: 536870912 > Time: 3.54989s > GFLOPS: 0.453708 > > > I guess we can conclude that only the addition of the BlockDist domain > map to the ProblemDomain results in a factor of 10 slowdown. > > Kind regards, > > Pieter Hijma > > On 14/11/16 20:21, Ben Harshbarger wrote: > > Hi Pieter, > > > > My next suggestion would be to try compiling and running the "single locale" variation with the same environment variables that you use for multilocale. I'm wondering if the use of IBV is impacting performance in some way. I don't see the performance issue on our internal ibv cluster, but it's worth checking. > > > > -Ben Harshbarger > > > > On 11/8/16, 12:25 PM, "Pieter Hijma" <[email protected]> wrote: > > > > Hi Ben, > > > > Thanks for your help. > > > > On 07/11/16 18:59, Ben Harshbarger wrote: > > > When CHPL_COMM is set to 'none', our compiler can avoid introducing some overhead that is necessary for multi-locale programs. You can force this overhead when CHPL_COMM == none by compiling with the flag "--no-local". If you compile your single-locale program with that flag, does the performance get worse? > > > > It makes some difference, but not much: > > > > chpl -o vectoradd --fast vectoradd.chpl > > > > addNoDomain n: 1073741824 > > Time: 0.57211s > > GFLOPS: 1.87681 > > > > addZip n: 1073741824 > > Time: 0.571799s > > GFLOPS: 1.87783 > > > > addForall n: 1073741824 > > Time: 0.571623s > > GFLOPS: 1.87841 > > > > addCollective n: 1073741824 > > Time: 0.571395s > > GFLOPS: 1.87916 > > > > > > chpl -o vectoradd --fast --no-local vectoradd.chpl > > > > addNoDomain n: 1073741824 > > Time: 0.62087s > > GFLOPS: 1.72941 > > > > addZip n: 1073741824 > > Time: 0.619997s > > GFLOPS: 1.73185 > > > > addForall n: 1073741824 > > Time: 0.620645s > > GFLOPS: 1.73004 > > > > addCollective n: 1073741824 > > Time: 0.620254s > > GFLOPS: 1.73113 > > > > > > > If that's the case, I'm not entirely sure what the next step would be. Do you have access to a newer version of GCC? The backend C compiler can matter when it comes to optimizing the multi-locale overhead. > > > > It is indeed an old one. We also have GCC 4.9.0, Intel 13.3, and I > > compiled GCC 6.2.0 to check: > > > > * intel/compiler/64/13.3/2013.3.163 > > > > I basically see the same behavior: > > > > single locale: > > > > addNoDomain n: 536870912 > > Time: 0.285186s > > GFLOPS: 1.88253 > > > > addZip n: 536870912 > > Time: 0.284819s > > GFLOPS: 1.88495 > > > > addForall n: 536870912 > > Time: 0.287904s > > GFLOPS: 1.86476 > > > > addCollective n: 536870912 > > Time: 0.284912s > > GFLOPS: 1.88434 > > > > multi-locale, one node: > > > > addNoDomain n: 536870912 > > Time: 3.24471s > > GFLOPS: 0.16546 > > > > addZip n: 536870912 > > Time: 3.01287s > > GFLOPS: 0.178192 > > > > addForall n: 536870912 > > Time: 7.23895s > > GFLOPS: 0.0741642 > > > > addCollective n: 536870912 > > Time: 2.59501s > > GFLOPS: 0.206886 > > > > > > * GCC 4.9.0 > > > > This is encouraging, the performance improves, a factor two of the > > single-locale, except for the explicit indices in the forall: > > > > single locale: > > > > addNoDomain n: 536870912 > > Time: 0.277222s > > GFLOPS: 1.93661 > > > > addZip n: 536870912 > > Time: 0.27566s > > GFLOPS: 1.94758 > > > > addForall n: 536870912 > > Time: 0.27609s > > GFLOPS: 1.94455 > > > > addCollective n: 536870912 > > Time: 0.275303s > > GFLOPS: 1.95011 > > > > multi-locale, single node: > > > > addNoDomain n: 536870912 > > Time: 0.492954s > > GFLOPS: 1.08909 > > > > addZip n: 536870912 > > Time: 0.493039s > > GFLOPS: 1.0889 > > > > addForall n: 536870912 > > Time: 2.85323s > > GFLOPS: 0.188162 > > > > addCollective n: 536870912 > > Time: 0.492135s > > GFLOPS: 1.0909 > > > > > > * GCC 6.2.0 > > > > The performance on multi-locale is now even better. Still very low for > > explicit indices in the forall. > > > > single locale: > > > > addNoDomain n: 536870912 > > Time: 0.283272s > > GFLOPS: 1.89525 > > > > addZip n: 536870912 > > Time: 0.281942s > > GFLOPS: 1.90419 > > > > addForall n: 536870912 > > Time: 0.282291s > > GFLOPS: 1.90184 > > > > addCollective n: 536870912 > > Time: 0.281629s > > GFLOPS: 1.90631 > > > > Multi-locale, single node: > > > > addNoDomain n: 536870912 > > Time: 0.358012s > > GFLOPS: 1.49959 > > > > addZip n: 536870912 > > Time: 0.356696s > > GFLOPS: 1.50512 > > > > addForall n: 536870912 > > Time: 2.92173s > > GFLOPS: 0.183751 > > > > addCollective n: 536870912 > > Time: 0.343808s > > GFLOPS: 1.56154 > > > > > > > > Since this is encouraging, I also verified the performance of the > > 1D-stencils: > > > > * GCC 4.4.7 > > > > For reference, the old compiler that I used initially: > > > > single locale: > > > > convolveIndices, n: 536870912 > > Time: 0.82361s > > GFLOPS: 1.95555 > > > > convolveZip, n: 536870912 > > Time: 0.810028s > > GFLOPS: 1.98834 > > > > mutli-locale, one node: > > > > convolveIndices, n: 536870912 > > Time: 4.25951s > > GFLOPS: 0.378122 > > > > convolveZip, n: 536870912 > > Time: 4.88046s > > GFLOPS: 0.330012 > > > > * intel/compiler/64/13.3/2013.3.163 > > > > On this compiler the single-node performance is better than the previous > > compiler. However, the multi-locale one node performance is about a > > factor 3 slower than the previous compiler. > > > > single locale: > > > > convolveIndices, n: 536870912 > > Time: 0.554139s > > GFLOPS: 2.90651 > > > > convolveZip, n: 536870912 > > Time: 0.556653s > > GFLOPS: 2.89339 > > > > > > multi-locale, one node: > > > > convolveIndices, n: 536870912 > > Time: 10.5368s > > GFLOPS: 0.152856 > > > > convolveZip, n: 536870912 > > Time: 12.7625s > > GFLOPS: 0.126198 > > > > > > * GCC 4.9.0 > > > > The performance of single locale is much better than GCC 4.4.7, however > > still poor for the multi-locale, one node configuration, although a bit > > better. > > > > single locale: > > > > convolveIndices, n: 536870912 > > Time: 0.207055s > > GFLOPS: 7.77867 > > > > convolveZip, n: 536870912 > > Time: 0.206783s > > GFLOPS: 7.7889 > > > > multi-locale, one node: > > > > convolveIndices, n: 536870912 > > Time: 3.20851s > > GFLOPS: 0.501981 > > > > convolveZip, n: 536870912 > > Time: 3.652s > > GFLOPS: 0.441023 > > > > > > * GCC 6.2.0 > > > > Strangely enough, the performance of single-locale is a bit lower than > > the previous, and the same as with multi-locale, one node. > > > > single-locale: > > > > convolveIndices, n: 536870912 > > Time: 0.263151s > > GFLOPS: 6.12049 > > > > convolveZip, n: 536870912 > > Time: 0.262234s > > GFLOPS: 6.14189 > > > > multi-locale, one node: > > > > convolveIndices, n: 536870912 > > Time: 3.12716s > > GFLOPS: 0.515039 > > > > convolveZip, n: 536870912 > > Time: 3.58663s > > GFLOPS: 0.44906 > > > > > > The conclusion is that the compiler has indeed a large impact on the > > multi-locale performance, but probably only in the simple cases such as > > vector addition. With the stencil code, although it is not very > > complicated, the performance falls back into the pattern that I came > > across originally. > > > > However, perhaps this gives you an idea of the optimizations that impact > > the performance? If we can't find a solution, I would at least like to > > understand the lack of performance. > > > > I also checked the performance of the stencils by not using the > > StencilDist but just the BlockDist and it makes no difference. > > > > > You may also want to consider setting CHPL_TARGET_ARCH to something else if you're compiling on a machine architecture different from the compute nodes. There's more information about CHPL_TARGET_ARCH here: > > > > > > http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch > > > > The head-node and compute-nodes are all Intel Xeon Westmere's, so I > > don't think that makes a difference. To be absolutely sure, I also > > compiled Chapel and the applications on a compute node and indeed, the > > performance is comparable to all measurements above. > > > > Kind regards, > > > > Pieter Hijma > > > > > > > On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote: > > > > > > Dear Ben, > > > > > > Sorry for my late reactions. Unfortunately, for some reason, these > > > emails are marked as spam even though I marked the list and your address > > > as safe. I will make sure I check my spam folders meticulously from now on. > > > > > > On 28/10/16 23:34, Ben Harshbarger wrote: > > > > Hi Pieter, > > > > > > > > Sorry that you're still having issues. I think we'll need some more information before going forward: > > > > > > > > 1) Could you send us the output of "$CHPL_HOME/util/printchplenv --anonymize" ? It's a script that displays the various CHPL_ environment variables. "--anonymize" strips the output of information you may prefer to keep private (machine info, paths). > > > > > > This would be the setup if running single-locale programs: > > > > > > $ printchplenv --anonymize > > > CHPL_TARGET_PLATFORM: linux64 > > > CHPL_TARGET_COMPILER: gnu > > > CHPL_TARGET_ARCH: native * > > > CHPL_LOCALE_MODEL: flat > > > CHPL_COMM: none > > > CHPL_TASKS: qthreads > > > CHPL_LAUNCHER: none > > > CHPL_TIMERS: generic > > > CHPL_UNWIND: none > > > CHPL_MEM: jemalloc > > > CHPL_MAKE: gmake > > > CHPL_ATOMICS: intrinsics > > > CHPL_GMP: gmp > > > CHPL_HWLOC: hwloc > > > CHPL_REGEXP: re2 > > > CHPL_WIDE_POINTERS: struct > > > CHPL_AUX_FILESYS: none > > > > > > When I run multi-locale programs, I set the following environment variables: > > > > > > export CHPL_COMM=gasnet > > > export CHPL_COMM_SUBSTRATE=ibv > > > > > > Then the Chapel environment would be: > > > > > > $ printchplenv --anonymize > > > CHPL_TARGET_PLATFORM: linux64 > > > CHPL_TARGET_COMPILER: gnu > > > CHPL_TARGET_ARCH: native * > > > CHPL_LOCALE_MODEL: flat > > > CHPL_COMM: gasnet * > > > CHPL_COMM_SUBSTRATE: ibv * > > > CHPL_GASNET_SEGMENT: large > > > CHPL_TASKS: qthreads > > > CHPL_LAUNCHER: gasnetrun_ibv > > > CHPL_TIMERS: generic > > > CHPL_UNWIND: none > > > CHPL_MEM: jemalloc > > > CHPL_MAKE: gmake > > > CHPL_ATOMICS: intrinsics > > > CHPL_NETWORK_ATOMICS: none > > > CHPL_GMP: gmp > > > CHPL_HWLOC: hwloc > > > CHPL_REGEXP: re2 > > > CHPL_WIDE_POINTERS: struct > > > CHPL_AUX_FILESYS: none > > > > > > > > > > 2) What C compiler are you using? > > > > > > $ gcc --version > > > gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16) > > > Copyright (C) 2010 Free Software Foundation, Inc. > > > This is free software; see the source for copying conditions. There is NO > > > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > > > > > > > 3) Are you sure that the programs are being launched correctly? This might seem silly, but it's worth double-checking that the programs are actually running on the same hardware (not necessarily the same node though). > > > > > > I am completely certain that the single-locale program, the multi-locale > > > program for one node, and the multi-locale for multiple nodes are > > > running on the compute nodes. I'm not completely sure what you mean by > > > "the same hardware". All compute nodes have the same hardware if that > > > is what you mean. > > > > > > > I'd also like to clarify what you mean by "multi-locale compiled". Is the difference between the programs just the use of the Block domain map, or do you compile with different environment variables set? > > > > > > I compile different programs and I use different environment variables: > > > > > > The single-locale version vectoradd is located in the datapar directory, > > > whereas the multi-locale version is located in the datapar-dist > > > directory. What follows is the diff for the .chpl file: > > > > > > $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl > > > 8c8 > > > < const ProblemDomain : domain(1) = {0..#n}; > > > --- > > > > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) > > > = {0..#n}; > > > > > > The diff for the Makefile: > > > > > > $ diff datapar/Makefile datapar-dist/Makefile > > > 2a3 > > > > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv > > > 8c9 > > > < $(CHPL) -o $@ $(FLAGS) $< > > > --- > > > > $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $< > > > 11c12 > > > < rm -f $(APP) > > > --- > > > > rm -f $(APP) $(APP)_real > > > > > > Thanks for your help, and again my apologies for the delayed answers. > > > > > > Kind regards, > > > > > > Pieter Hijma > > > > > > > > > > > -Ben Harshbarger > > > > > > > > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> wrote: > > > > > > > > Hi Ben, > > > > > > > > Thank you for your fast reply and suggestions! I did some more tests > > > > and also included stencil operations. > > > > > > > > First, the vector addition: > > > > > > > > vectoradd.chpl > > > > -------------- > > > > use Time; > > > > use Random; > > > > use BlockDist; > > > > //use VisualDebug; > > > > > > > > config const n = 1024**3/2; > > > > > > > > // for multi-locale > > > > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) > > > > = {0..#n}; > > > > // for single-locale > > > > const ProblemDomain : domain(1) = {0..#n}; > > > > > > > > type float = real(32); > > > > > > > > proc addNoDomain(c : [] float, a : [] float, b : [] float) { > > > > forall (ci, ai, bi) in zip(c, a, b) { > > > > ci = ai + bi; > > > > } > > > > } > > > > > > > > proc addZip(c : [ProblemDomain] float, a : [ProblemDomain] float, > > > > b : [ProblemDomain] float) { > > > > forall (ci, ai, bi) in zip(c, a, b) { > > > > ci = ai + bi; > > > > } > > > > } > > > > > > > > proc addForall(c : [ProblemDomain] float, a : [ProblemDomain] float, > > > > b : [ProblemDomain] float) { > > > > //startVdebug("vdata"); > > > > forall i in ProblemDomain { > > > > c[i] = a[i] + b[i]; > > > > } > > > > //stopVdebug(); > > > > } > > > > > > > > proc addCollective(c : [ProblemDomain] float, a : [ProblemDomain] float, > > > > b : [ProblemDomain] float) { > > > > c = a + b; > > > > } > > > > > > > > proc output(t : Timer, n, testName) { > > > > t.stop(); > > > > writeln(testName, " n: ", n); > > > > writeln("Time: ", t.elapsed(), "s"); > > > > writeln("GFLOPS: ", n / t.elapsed() / 1e9, ""); > > > > writeln(); > > > > t.clear(); > > > > } > > > > > > > > proc main() { > > > > var c : [ProblemDomain] float; > > > > var a : [ProblemDomain] float; > > > > var b : [ProblemDomain] float; > > > > var t : Timer; > > > > > > > > fillRandom(a, 0); > > > > fillRandom(b, 42); > > > > > > > > t.start(); > > > > addNoDomain(c, a, b); > > > > output(t, n, "addNoDomain"); > > > > > > > > t.start(); > > > > addZip(c, a, b); > > > > output(t, n, "addZip"); > > > > > > > > t.start(); > > > > addForall(c, a, b); > > > > output(t, n, "addForall"); > > > > > > > > t.start(); > > > > addCollective(c, a, b); > > > > output(t, n, "addCollective"); > > > > } > > > > ----- > > > > > > > > On a single locale I get as output: > > > > > > > > addNoDomain n: 536870912 > > > > Time: 0.27961s > > > > GFLOPS: 1.92007 > > > > > > > > addZip n: 536870912 > > > > Time: 0.278657s > > > > GFLOPS: 1.92664 > > > > > > > > addForall n: 536870912 > > > > Time: 0.278015s > > > > GFLOPS: 1.93109 > > > > > > > > addCollective n: 536870912 > > > > Time: 0.278379s > > > > GFLOPS: 1.92856 > > > > > > > > On multi-locale (-nl 1) I get as output: > > > > > > > > addNoDomain n: 536870912 > > > > Time: 2.16806s > > > > GFLOPS: 0.247627 > > > > > > > > addZip n: 536870912 > > > > Time: 2.17024s > > > > GFLOPS: 0.247378 > > > > > > > > addForall n: 536870912 > > > > Time: 4.78443s > > > > GFLOPS: 0.112212 > > > > > > > > addCollective n: 536870912 > > > > Time: 2.19838s > > > > GFLOPS: 0.244212 > > > > > > > > So, indeed, your suggestion improves it by more than a factor two, but > > > > it is still close to a factor 8 slower than single-locale. > > > > > > > > I also used chplvis and verified that there are no gets and puts when > > > > running multi-locale with more than one node. The profiling information > > > > is clear, but not very helpful (to me): > > > > > > > > multi-locale (-nl 1): > > > > > > > > | 65.3451 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 | > > > > | 4.8777 | wrapon_fn_chpl35 | vectoradd.chpl:26 | > > > > > > > > single-locale: > > > > > > > > | 5.0019 | wrapcoforall_fn_chpl5 | vectoradd.chpl:26 | > > > > > > > > > > > > > > > > For stencil operations, I used the following program: > > > > > > > > 1d-convolution.chpl > > > > ------------------- > > > > use Time; > > > > use Random; > > > > use StencilDist; > > > > > > > > config const n = 1024**3/2; > > > > > > > > const ProblemDomain : domain(1) dmapped Stencil(boundingBox = {0..#n}, > > > > fluff = (1,)) > > > > = {0..#n}; > > > > const InnerDomain : subdomain(ProblemDomain) = {1..n-2}; > > > > > > > > proc convolveIndices(output : [ProblemDomain] real(32), > > > > input : [ProblemDomain] real(32)) { > > > > forall i in InnerDomain { > > > > output[i] = ((input[i-1] + input[i] + input[i+1])/3:real(32)); > > > > } > > > > } > > > > > > > > proc convolveZip(output : [ProblemDomain] real(32), > > > > input : [ProblemDomain] real(32)) { > > > > forall (im1, i, ip1) in zip(InnerDomain.translate(-1), > > > > InnerDomain, > > > > InnerDomain.translate(1)) { > > > > output[i] = ((input[im1] + input[i] + input[ip1])/3:real(32)); > > > > } > > > > } > > > > > > > > proc print(t : Timer, n, s) { > > > > t.stop(); > > > > writeln(s, ", n: ", n); > > > > writeln("Time: ", t.elapsed(), "s"); > > > > writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed()); > > > > writeln(); > > > > t.clear(); > > > > } > > > > > > > > proc main() { > > > > var input : [ProblemDomain] real(32); > > > > var output : [ProblemDomain] real(32); > > > > var t : Timer; > > > > > > > > fillRandom(input, 42); > > > > > > > > t.start(); > > > > convolveIndices(output, input); > > > > print(t, n, "convolveIndices"); > > > > > > > > t.start(); > > > > convolveZip(output, input); > > > > print(t, n, "convolveZip"); > > > > } > > > > ------ > > > > > > > > Interestingly, in contrast to your earlier suggestion, the direct > > > > indexing works a bit better in this program than the zipped version: > > > > > > > > Multi-locale (-nl 1): > > > > > > > > convolveIndices, n: 536870912 > > > > Time: 4.27148s > > > > GFLOPS: 0.377062 > > > > > > > > convolveZip, n: 536870912 > > > > Time: 4.87291s > > > > GFLOPS: 0.330524 > > > > > > > > Single-locale: > > > > > > > > convolveIndices, n: 536870912 > > > > Time: 0.548804s > > > > GFLOPS: 2.93477 > > > > > > > > convolveZip, n: 536870912 > > > > Time: 0.538754s > > > > GFLOPS: 2.98951 > > > > > > > > > > > > Again, the multi-locale is about a factor 8 slower than single-locale. > > > > By the way, the Stencil distribution is a bit faster than the Block > > > > distribution. > > > > > > > > Thanks in advance for your input, > > > > > > > > Pieter > > > > > > > > > > > > > > > > On 24/10/16 19:20, Ben Harshbarger wrote: > > > > > Hi Pieter, > > > > > > > > > > Thanks for providing the example, that's very helpful. > > > > > > > > > > Multi-locale performance in Chapel is not yet where we'd like it to be, but we've done a lot of work over the past few releases to get cases like yours performing well. It's surprising that using Block results in that much of a difference, but I think you would see better performance by iterating over the arrays directly: > > > > > > > > > > ``` > > > > > // replace the loop in the 'add' function with this: > > > > > forall (ci, ai, bi) in zip(c, a, b) { > > > > > ci = ai + bi; > > > > > } > > > > > ``` > > > > > > > > > > Block-distributed arrays can leverage the fast-follower optimization to perform better when all arrays being iterated over share the same domain. You can also write that loop in a cleaner way by leveraging array promotion: > > > > > > > > > > ``` > > > > > // This is equivalent to the first loop > > > > > c = a + b; > > > > > ``` > > > > > > > > > > However, when I tried the promoted variation on my machine I observed worse performance than the explicit forall-loop. It seems to be related to the way the arguments of the 'add' function are declared. If you replaced "[ProblemDomain] float" with "[] float", performance seems to improve. That surprised a couple of us on the development team, and I'll be looking at that some more today. > > > > > > > > > > If you're still seeing significantly worse performance with Block compared to the default rectangular domain, and the programs are launched in the same way, that would be odd. You could try profiling using chplvis. I agree though that there shouldn't be any communication in this program. You can find more information on chplvis here in the online 1.14 release documentation: > > > > > > > > > > http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html > > > > > > > > > > I hope that rewriting the loops solves the problem, but let us know if it doesn't and we can continue investigating. > > > > > > > > > > -Ben Harshbarger > > > > > > > > > > On 10/24/16, 6:19 AM, "Pieter Hijma" <[email protected]> wrote: > > > > > > > > > > Dear all, > > > > > > > > > > My apologies if this has already been asked before. I'm new to the list > > > > > and couldn't find it in the archives. > > > > > > > > > > I experience bad performance when running the multi-locale compiled > > > > > version on an InfiniBand equiped cluster > > > > > (http://cs.vu.nl/das4/clusters.shtml, VU-site), even with only one node. > > > > > Below you find a minimal example that exhibits the same performance > > > > > problems as all my programs: > > > > > > > > > > I compiled chapel-1.14.0 with the following steps: > > > > > > > > > > export CHPL_TARGET_ARCH=native > > > > > make -j > > > > > export CHPL_COMM=gasnet > > > > > export CHPL_COMM_SUBSTRATE=ibv > > > > > make clean > > > > > make -j > > > > > > > > > > I compile the following Chapel code: > > > > > > > > > > vectoradd.chpl: > > > > > --------------- > > > > > use Time; > > > > > use Random; > > > > > use BlockDist; > > > > > > > > > > config const n = 1024**3; > > > > > > > > > > // for single-locale > > > > > // const ProblemDomain : domain(1) = {0..#n}; > > > > > // for multi-locale > > > > > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) = > > > > > {0..#n}; > > > > > > > > > > type float = real(32); > > > > > > > > > > proc add(c : [ProblemDomain] float, a : [ProblemDomain] float, > > > > > b : [ProblemDomain] float) { > > > > > forall i in ProblemDomain { > > > > > c[i] = a[i] + b[i]; > > > > > } > > > > > } > > > > > > > > > > proc main() { > > > > > var c : [ProblemDomain] float; > > > > > var a : [ProblemDomain] float; > > > > > var b : [ProblemDomain] float; > > > > > var t : Timer; > > > > > > > > > > fillRandom(a, 0); > > > > > fillRandom(b, 42); > > > > > > > > > > t.start(); > > > > > add(c, a, b); > > > > > t.stop(); > > > > > > > > > > writeln("n: ", n); > > > > > writeln("Time: ", t.elapsed(), "s"); > > > > > writeln("GFLOPS: ", n / t.elapsed() / 1e9, "s"); > > > > > } > > > > > ---- > > > > > > > > > > I compile this for single-locale with (using no domain maps, see the > > > > > comment above in the source): > > > > > > > > > > chpl -o vectoradd --fast vectoradd.chpl > > > > > > > > > > I run it with (dual quad core with 2 hardware threads): > > > > > > > > > > export CHPL_RT_NUM_THREADS_PER_LOCALE=16 > > > > > ./vectoradd > > > > > > > > > > And get as output: > > > > > > > > > > n: 1073741824 > > > > > Time: 0.558806s > > > > > GFLOPS: 1.92149s > > > > > > > > > > However, the performance for multi-locale is much worse: > > > > > > > > > > I compile this for multi-locale with domain maps, see the comment in the > > > > > source): > > > > > > > > > > CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \ > > > > > vectoradd.chpl > > > > > > > > > > I run it on the same type of node with: > > > > > > > > > > SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '` > > > > > > > > > > export GASNET_PHYSMEM_MAX=1G > > > > > export GASNET_IBV_SPAWNER=ssh > > > > > export GASNET_SSH_SERVERS="$SSH_SERVERS" > > > > > > > > > > export CHPL_RT_NUM_THREADS_PER_LOCALE=16 > > > > > export CHPL_LAUNCHER=gasnetrun_ibv > > > > > export CHPL_COMM=gasnet > > > > > export CHPL_COMM_SUBSTRATE=ibv > > > > > > > > > > ./vectoradd -nl 1 > > > > > > > > > > And get as output: > > > > > > > > > > n: 1073741824 > > > > > Time: 8.65082s > > > > > GFLOPS: 0.12412s > > > > > > > > > > I would understand a performance difference of say 10% because of > > > > > multi-locale execution, but not factors. Is this to be expected from > > > > > the current state of Chapel? This performance difference is examplary > > > > > for basically all my programs that also are more realistic and use > > > > > larger inputs. The performance is strange as there is no communication > > > > > necessary (only one node) and the program is using the same amount of > > > > > threads. > > > > > > > > > > Is there any way for me to investigate this using profiling for example? > > > > > > > > > > By the way, the program does scale well to multiple nodes (which is not > > > > > difficult given the baseline): > > > > > > > > > > 1 | 8.65s > > > > > 2 | 2.67s > > > > > 4 | 1.69s > > > > > 8 | 0.87s > > > > > 16 | 0.41s > > > > > > > > > > Thanks in advance for your input. > > > > > > > > > > Kind regards, > > > > > > > > > > Pieter Hijma > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > Check out the vibrant tech community on one of the world's most > > > > > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > > > > > _______________________________________________ > > > > > Chapel-users mailing list > > > > > [email protected] > > > > > https://lists.sourceforge.net/lists/listinfo/chapel-users > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ _______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
