Hi Ben, No thanks, thank you for your help and explanations! I would certainly be interested in learning about your results.
Kind regards, Pieter On 17/11/16 18:25, Ben Harshbarger wrote: > Hi Pieter, > > I think that's a fair conclusion. Thanks for being patient and providing the > code examples and timings. On my end, I'll try and work on creating some > mini-benchmarks to track this difference more clearly (for both the vectorAdd > and 1D-convolution use-cases). > > As far as understanding overhead for direct-indexing in BlockDist, here are > some issues that I'm aware of: > - Overhead when the compiler is unable to infer that data is local. In such > cases we introduce "wide pointers", which the back-end C compiler may not be > able to optimize as effectively as a normal pointer. This is what may have > been impacted by switching from gcc 4.4 to 6.2. > - Overhead to check which locale "owns" an index > - Possible overhead for privatized objects > > Though this isn't a satisfying conclusion, hopefully it helps to get some > sense of how different patterns perform. > > -Ben Harshbarger > > On 11/17/16, 1:28 AM, "Pieter Hijma" <[email protected]> wrote: > > Hi Ben, > > Same setup, testing with GCC 6.2.0, single-locale in directory > 'datapar', multi-locale in directory 'datapar-dist'. > > Similarly to the 1D-convolution case, the Makefiles are the same: > > $ diff datapar/Makefile datapar/Makefile > > This means that I compile both programs with: > > $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \ > vectoradd.chpl > > The job files are also the same: > > $ diff datapar/das4.job datapar/das4.job > > The contents of das4.job (an SGE script), basically the same as for the > 1D-convolution: > > --- > #!/bin/bash > #$ -l h_rt=0:15:00 > #$ -N VECTORADD > #$ -cwd > > . ~/.bashrc > > SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '` > > export CHPL_COMM=gasnet > export CHPL_COMM_SUBSTRATE=ibv > export CHPL_LAUNCHER=gasnetrun_ibv > export CHPL_RT_NUM_THREADS_PER_LOCALE=16 > > export GASNET_IBV_SPAWNER=ssh > export GASNET_PHYSMEM_MAX=1G > export GASNET_SSH_SERVERS="$SSH_SERVERS" > > APP=./vectoradd > ARGS=$* > > $APP $ARGS > --- > > The two source files differ in the use of domain-maps: > > $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl > 2a3 > > use BlockDist; > 7c8 > < const ProblemDomain : domain(1) = {0..#n}; > --- > > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n}) > = {0..#n}; > ERROR: 1 > > > The output of 'datapar', originally the single-locale version: > > addNoDomain n: 536870912 > Time: 0.329722s > GFLOPS: 1.62825 > > addZip n: 536870912 > Time: 0.328751s > GFLOPS: 1.63306 > > addForall n: 536870912 > Time: 0.325768s > GFLOPS: 1.64802 > > addCollective n: 536870912 > Time: 0.330918s > GFLOPS: 1.62237 > > > > The output of 'datapar-dist', the multi-locale version: > > addNoDomain n: 536870912 > Time: 0.373368s > GFLOPS: 1.43791 > > addZip n: 536870912 > Time: 0.372561s > GFLOPS: 1.44103 > > addForall n: 536870912 > Time: 2.66822s > GFLOPS: 0.201209 > > addCollective n: 536870912 > Time: 0.36856s > GFLOPS: 1.45667 > > > I guess the conclusion is that also in this case, the use of the > BlockDist has an effect, minor overall, and major when indexing directly. > > Kind regards, > > Pieter Hijma > > On 16/11/16 20:32, Ben Harshbarger wrote: > > Hi Pieter, > > > > Do you still see a problem with vectorAdd.chpl? I think 1D-convolution > has a somewhat separate performance issue due to accessing with arbitrary > indices (more overhead because we don't know if the index is local). If > vectorAdd isn't performing well, then that could hurt 1D-convolution too. > > > > -Ben Harshbarger > > > > On 11/16/16, 3:33 AM, "Pieter Hijma" <[email protected]> wrote: > > > > Hi Ben, > > > > Good suggestion, I'm going to test with the 1D-convolution program > and I > > use the Chapel version compiled with GCC 6.2.0. The single-locale > > version is in directory 'datapar' and the multi-locale version is in > > directory 'datapar-dist'. > > > > The Makefiles are now the same: > > > > $ diff datapar/Makefile datapar/Makefile > > > > This means that I compile both programs with: > > > > $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 1D-convolution \ > > --fast 1D-convolution.chpl > > > > The job files are also the same: > > > > $ diff datapar/das4.job datapar/das4.job > > > > The contents of das4.job (an SGE script): > > > > ----- > > #!/bin/bash > > #$ -l h_rt=0:15:00 > > #$ -N CONVOLUTION_1D > > #$ -cwd > > > > . ~/.bashrc > > > > SSH_SERVERS=`uniq $TMPDIR/machines | tr '\n' ' '` > > > > export CHPL_COMM=gasnet > > export CHPL_COMM_SUBSTRATE=ibv > > export CHPL_LAUNCHER=gasnetrun_ibv > > export CHPL_RT_NUM_THREADS_PER_LOCALE=16 > > > > export GASNET_IBV_SPAWNER=ssh > > export GASNET_PHYSMEM_MAX=1G > > export GASNET_SSH_SERVERS="$SSH_SERVERS" > > > > APP=./1D-convolution > > ARGS=$* > > > > $APP $ARGS > > ------ > > > > The difference between the two source files is now basically that > the > > single-locale version has only the default domain map, whereas the > > multi-locale has the BlockDist domain map: > > > > $ diff datapar/1D-convolution.chpl datapar-dist/1D-convolution.chpl > > 2a3 > > > use BlockDist; > > 6c7 > > < const ProblemDomain : domain(1) = {0..#n}; > > --- > > > const ProblemDomain : domain(1) dmapped Block(boundingBox = > {0..#n}) > > = {0..#n}; > > ERROR: 1 > > > > > > The output of 'datapar', the original single-locale version without > > BlockDist: > > > > convolveIndices, n: 536870912 > > Time: 0.319077s > > GFLOPS: 5.04772 > > > > convolveZip, n: 536870912 > > Time: 0.320788s > > GFLOPS: 5.0208 > > > > > > The output of 'datapar-dist', the original multi-locale version with > > BlockDist: > > > > convolveIndices, n: 536870912 > > Time: 3.1422s > > GFLOPS: 0.512575 > > > > convolveZip, n: 536870912 > > Time: 3.54989s > > GFLOPS: 0.453708 > > > > > > I guess we can conclude that only the addition of the BlockDist > domain > > map to the ProblemDomain results in a factor of 10 slowdown. > > > > Kind regards, > > > > Pieter Hijma > > > > On 14/11/16 20:21, Ben Harshbarger wrote: > > > Hi Pieter, > > > > > > My next suggestion would be to try compiling and running the > "single locale" variation with the same environment variables that you use > for multilocale. I'm wondering if the use of IBV is impacting performance in > some way. I don't see the performance issue on our internal ibv cluster, but > it's worth checking. > > > > > > -Ben Harshbarger > > > > > > On 11/8/16, 12:25 PM, "Pieter Hijma" <[email protected]> wrote: > > > > > > Hi Ben, > > > > > > Thanks for your help. > > > > > > On 07/11/16 18:59, Ben Harshbarger wrote: > > > > When CHPL_COMM is set to 'none', our compiler can avoid > introducing some overhead that is necessary for multi-locale programs. You > can force this overhead when CHPL_COMM == none by compiling with the flag > "--no-local". If you compile your single-locale program with that flag, does > the performance get worse? > > > > > > It makes some difference, but not much: > > > > > > chpl -o vectoradd --fast vectoradd.chpl > > > > > > addNoDomain n: 1073741824 > > > Time: 0.57211s > > > GFLOPS: 1.87681 > > > > > > addZip n: 1073741824 > > > Time: 0.571799s > > > GFLOPS: 1.87783 > > > > > > addForall n: 1073741824 > > > Time: 0.571623s > > > GFLOPS: 1.87841 > > > > > > addCollective n: 1073741824 > > > Time: 0.571395s > > > GFLOPS: 1.87916 > > > > > > > > > chpl -o vectoradd --fast --no-local vectoradd.chpl > > > > > > addNoDomain n: 1073741824 > > > Time: 0.62087s > > > GFLOPS: 1.72941 > > > > > > addZip n: 1073741824 > > > Time: 0.619997s > > > GFLOPS: 1.73185 > > > > > > addForall n: 1073741824 > > > Time: 0.620645s > > > GFLOPS: 1.73004 > > > > > > addCollective n: 1073741824 > > > Time: 0.620254s > > > GFLOPS: 1.73113 > > > > > > > > > > If that's the case, I'm not entirely sure what the next > step would be. Do you have access to a newer version of GCC? The backend C > compiler can matter when it comes to optimizing the multi-locale overhead. > > > > > > It is indeed an old one. We also have GCC 4.9.0, Intel 13.3, > and I > > > compiled GCC 6.2.0 to check: > > > > > > * intel/compiler/64/13.3/2013.3.163 > > > > > > I basically see the same behavior: > > > > > > single locale: > > > > > > addNoDomain n: 536870912 > > > Time: 0.285186s > > > GFLOPS: 1.88253 > > > > > > addZip n: 536870912 > > > Time: 0.284819s > > > GFLOPS: 1.88495 > > > > > > addForall n: 536870912 > > > Time: 0.287904s > > > GFLOPS: 1.86476 > > > > > > addCollective n: 536870912 > > > Time: 0.284912s > > > GFLOPS: 1.88434 > > > > > > multi-locale, one node: > > > > > > addNoDomain n: 536870912 > > > Time: 3.24471s > > > GFLOPS: 0.16546 > > > > > > addZip n: 536870912 > > > Time: 3.01287s > > > GFLOPS: 0.178192 > > > > > > addForall n: 536870912 > > > Time: 7.23895s > > > GFLOPS: 0.0741642 > > > > > > addCollective n: 536870912 > > > Time: 2.59501s > > > GFLOPS: 0.206886 > > > > > > > > > * GCC 4.9.0 > > > > > > This is encouraging, the performance improves, a factor two > of the > > > single-locale, except for the explicit indices in the forall: > > > > > > single locale: > > > > > > addNoDomain n: 536870912 > > > Time: 0.277222s > > > GFLOPS: 1.93661 > > > > > > addZip n: 536870912 > > > Time: 0.27566s > > > GFLOPS: 1.94758 > > > > > > addForall n: 536870912 > > > Time: 0.27609s > > > GFLOPS: 1.94455 > > > > > > addCollective n: 536870912 > > > Time: 0.275303s > > > GFLOPS: 1.95011 > > > > > > multi-locale, single node: > > > > > > addNoDomain n: 536870912 > > > Time: 0.492954s > > > GFLOPS: 1.08909 > > > > > > addZip n: 536870912 > > > Time: 0.493039s > > > GFLOPS: 1.0889 > > > > > > addForall n: 536870912 > > > Time: 2.85323s > > > GFLOPS: 0.188162 > > > > > > addCollective n: 536870912 > > > Time: 0.492135s > > > GFLOPS: 1.0909 > > > > > > > > > * GCC 6.2.0 > > > > > > The performance on multi-locale is now even better. Still > very low for > > > explicit indices in the forall. > > > > > > single locale: > > > > > > addNoDomain n: 536870912 > > > Time: 0.283272s > > > GFLOPS: 1.89525 > > > > > > addZip n: 536870912 > > > Time: 0.281942s > > > GFLOPS: 1.90419 > > > > > > addForall n: 536870912 > > > Time: 0.282291s > > > GFLOPS: 1.90184 > > > > > > addCollective n: 536870912 > > > Time: 0.281629s > > > GFLOPS: 1.90631 > > > > > > Multi-locale, single node: > > > > > > addNoDomain n: 536870912 > > > Time: 0.358012s > > > GFLOPS: 1.49959 > > > > > > addZip n: 536870912 > > > Time: 0.356696s > > > GFLOPS: 1.50512 > > > > > > addForall n: 536870912 > > > Time: 2.92173s > > > GFLOPS: 0.183751 > > > > > > addCollective n: 536870912 > > > Time: 0.343808s > > > GFLOPS: 1.56154 > > > > > > > > > > > > Since this is encouraging, I also verified the performance of > the > > > 1D-stencils: > > > > > > * GCC 4.4.7 > > > > > > For reference, the old compiler that I used initially: > > > > > > single locale: > > > > > > convolveIndices, n: 536870912 > > > Time: 0.82361s > > > GFLOPS: 1.95555 > > > > > > convolveZip, n: 536870912 > > > Time: 0.810028s > > > GFLOPS: 1.98834 > > > > > > mutli-locale, one node: > > > > > > convolveIndices, n: 536870912 > > > Time: 4.25951s > > > GFLOPS: 0.378122 > > > > > > convolveZip, n: 536870912 > > > Time: 4.88046s > > > GFLOPS: 0.330012 > > > > > > * intel/compiler/64/13.3/2013.3.163 > > > > > > On this compiler the single-node performance is better than > the previous > > > compiler. However, the multi-locale one node performance is > about a > > > factor 3 slower than the previous compiler. > > > > > > single locale: > > > > > > convolveIndices, n: 536870912 > > > Time: 0.554139s > > > GFLOPS: 2.90651 > > > > > > convolveZip, n: 536870912 > > > Time: 0.556653s > > > GFLOPS: 2.89339 > > > > > > > > > multi-locale, one node: > > > > > > convolveIndices, n: 536870912 > > > Time: 10.5368s > > > GFLOPS: 0.152856 > > > > > > convolveZip, n: 536870912 > > > Time: 12.7625s > > > GFLOPS: 0.126198 > > > > > > > > > * GCC 4.9.0 > > > > > > The performance of single locale is much better than GCC > 4.4.7, however > > > still poor for the multi-locale, one node configuration, > although a bit > > > better. > > > > > > single locale: > > > > > > convolveIndices, n: 536870912 > > > Time: 0.207055s > > > GFLOPS: 7.77867 > > > > > > convolveZip, n: 536870912 > > > Time: 0.206783s > > > GFLOPS: 7.7889 > > > > > > multi-locale, one node: > > > > > > convolveIndices, n: 536870912 > > > Time: 3.20851s > > > GFLOPS: 0.501981 > > > > > > convolveZip, n: 536870912 > > > Time: 3.652s > > > GFLOPS: 0.441023 > > > > > > > > > * GCC 6.2.0 > > > > > > Strangely enough, the performance of single-locale is a bit > lower than > > > the previous, and the same as with multi-locale, one node. > > > > > > single-locale: > > > > > > convolveIndices, n: 536870912 > > > Time: 0.263151s > > > GFLOPS: 6.12049 > > > > > > convolveZip, n: 536870912 > > > Time: 0.262234s > > > GFLOPS: 6.14189 > > > > > > multi-locale, one node: > > > > > > convolveIndices, n: 536870912 > > > Time: 3.12716s > > > GFLOPS: 0.515039 > > > > > > convolveZip, n: 536870912 > > > Time: 3.58663s > > > GFLOPS: 0.44906 > > > > > > > > > The conclusion is that the compiler has indeed a large impact > on the > > > multi-locale performance, but probably only in the simple > cases such as > > > vector addition. With the stencil code, although it is not > very > > > complicated, the performance falls back into the pattern that > I came > > > across originally. > > > > > > However, perhaps this gives you an idea of the optimizations > that impact > > > the performance? If we can't find a solution, I would at > least like to > > > understand the lack of performance. > > > > > > I also checked the performance of the stencils by not using > the > > > StencilDist but just the BlockDist and it makes no difference. > > > > > > > You may also want to consider setting CHPL_TARGET_ARCH to > something else if you're compiling on a machine architecture different from > the compute nodes. There's more information about CHPL_TARGET_ARCH here: > > > > > > > > > http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch > > > > > > The head-node and compute-nodes are all Intel Xeon > Westmere's, so I > > > don't think that makes a difference. To be absolutely sure, > I also > > > compiled Chapel and the applications on a compute node and > indeed, the > > > performance is comparable to all measurements above. > > > > > > Kind regards, > > > > > > Pieter Hijma > > > > > > > > > > On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote: > > > > > > > > Dear Ben, > > > > > > > > Sorry for my late reactions. Unfortunately, for some > reason, these > > > > emails are marked as spam even though I marked the list > and your address > > > > as safe. I will make sure I check my spam folders > meticulously from now on. > > > > > > > > On 28/10/16 23:34, Ben Harshbarger wrote: > > > > > Hi Pieter, > > > > > > > > > > Sorry that you're still having issues. I think we'll > need some more information before going forward: > > > > > > > > > > 1) Could you send us the output of > "$CHPL_HOME/util/printchplenv --anonymize" ? It's a script that displays the > various CHPL_ environment variables. "--anonymize" strips the output of > information you may prefer to keep private (machine info, paths). > > > > > > > > This would be the setup if running single-locale > programs: > > > > > > > > $ printchplenv --anonymize > > > > CHPL_TARGET_PLATFORM: linux64 > > > > CHPL_TARGET_COMPILER: gnu > > > > CHPL_TARGET_ARCH: native * > > > > CHPL_LOCALE_MODEL: flat > > > > CHPL_COMM: none > > > > CHPL_TASKS: qthreads > > > > CHPL_LAUNCHER: none > > > > CHPL_TIMERS: generic > > > > CHPL_UNWIND: none > > > > CHPL_MEM: jemalloc > > > > CHPL_MAKE: gmake > > > > CHPL_ATOMICS: intrinsics > > > > CHPL_GMP: gmp > > > > CHPL_HWLOC: hwloc > > > > CHPL_REGEXP: re2 > > > > CHPL_WIDE_POINTERS: struct > > > > CHPL_AUX_FILESYS: none > > > > > > > > When I run multi-locale programs, I set the following > environment variables: > > > > > > > > export CHPL_COMM=gasnet > > > > export CHPL_COMM_SUBSTRATE=ibv > > > > > > > > Then the Chapel environment would be: > > > > > > > > $ printchplenv --anonymize > > > > CHPL_TARGET_PLATFORM: linux64 > > > > CHPL_TARGET_COMPILER: gnu > > > > CHPL_TARGET_ARCH: native * > > > > CHPL_LOCALE_MODEL: flat > > > > CHPL_COMM: gasnet * > > > > CHPL_COMM_SUBSTRATE: ibv * > > > > CHPL_GASNET_SEGMENT: large > > > > CHPL_TASKS: qthreads > > > > CHPL_LAUNCHER: gasnetrun_ibv > > > > CHPL_TIMERS: generic > > > > CHPL_UNWIND: none > > > > CHPL_MEM: jemalloc > > > > CHPL_MAKE: gmake > > > > CHPL_ATOMICS: intrinsics > > > > CHPL_NETWORK_ATOMICS: none > > > > CHPL_GMP: gmp > > > > CHPL_HWLOC: hwloc > > > > CHPL_REGEXP: re2 > > > > CHPL_WIDE_POINTERS: struct > > > > CHPL_AUX_FILESYS: none > > > > > > > > > > > > > 2) What C compiler are you using? > > > > > > > > $ gcc --version > > > > gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16) > > > > Copyright (C) 2010 Free Software Foundation, Inc. > > > > This is free software; see the source for copying > conditions. There is NO > > > > warranty; not even for MERCHANTABILITY or FITNESS FOR A > PARTICULAR PURPOSE. > > > > > > > > > 3) Are you sure that the programs are being launched > correctly? This might seem silly, but it's worth double-checking that the > programs are actually running on the same hardware (not necessarily the same > node though). > > > > > > > > I am completely certain that the single-locale program, > the multi-locale > > > > program for one node, and the multi-locale for multiple > nodes are > > > > running on the compute nodes. I'm not completely sure > what you mean by > > > > "the same hardware". All compute nodes have the same > hardware if that > > > > is what you mean. > > > > > > > > > I'd also like to clarify what you mean by > "multi-locale compiled". Is the difference between the programs just the use > of the Block domain map, or do you compile with different environment > variables set? > > > > > > > > I compile different programs and I use different > environment variables: > > > > > > > > The single-locale version vectoradd is located in the > datapar directory, > > > > whereas the multi-locale version is located in the > datapar-dist > > > > directory. What follows is the diff for the .chpl file: > > > > > > > > $ diff datapar/vectoradd.chpl > datapar-dist/vectoradd.chpl > > > > 8c8 > > > > < const ProblemDomain : domain(1) = {0..#n}; > > > > --- > > > > > const ProblemDomain : domain(1) dmapped > Block(boundingBox = {0..#n}) > > > > = {0..#n}; > > > > > > > > The diff for the Makefile: > > > > > > > > $ diff datapar/Makefile datapar-dist/Makefile > > > > 2a3 > > > > > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv > > > > 8c9 > > > > < $(CHPL) -o $@ $(FLAGS) $< > > > > --- > > > > > $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $< > > > > 11c12 > > > > < rm -f $(APP) > > > > --- > > > > > rm -f $(APP) $(APP)_real > > > > > > > > Thanks for your help, and again my apologies for the > delayed answers. > > > > > > > > Kind regards, > > > > > > > > Pieter Hijma > > > > > > > > > > > > > > -Ben Harshbarger > > > > > > > > > > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> > wrote: > > > > > > > > > > Hi Ben, > > > > > > > > > > Thank you for your fast reply and suggestions! I > did some more tests > > > > > and also included stencil operations. > > > > > > > > > > First, the vector addition: > > > > > > > > > > vectoradd.chpl > > > > > -------------- > > > > > use Time; > > > > > use Random; > > > > > use BlockDist; > > > > > //use VisualDebug; > > > > > > > > > > config const n = 1024**3/2; > > > > > > > > > > // for multi-locale > > > > > const ProblemDomain : domain(1) dmapped > Block(boundingBox = {0..#n}) > > > > > = {0..#n}; > > > > > // for single-locale > > > > > const ProblemDomain : domain(1) = {0..#n}; > > > > > > > > > > type float = real(32); > > > > > > > > > > proc addNoDomain(c : [] float, a : [] float, b : > [] float) { > > > > > forall (ci, ai, bi) in zip(c, a, b) { > > > > > ci = ai + bi; > > > > > } > > > > > } > > > > > > > > > > proc addZip(c : [ProblemDomain] float, a : > [ProblemDomain] float, > > > > > b : [ProblemDomain] float) { > > > > > forall (ci, ai, bi) in zip(c, a, b) { > > > > > ci = ai + bi; > > > > > } > > > > > } > > > > > > > > > > proc addForall(c : [ProblemDomain] float, a : > [ProblemDomain] float, > > > > > b : [ProblemDomain] float) { > > > > > //startVdebug("vdata"); > > > > > forall i in ProblemDomain { > > > > > c[i] = a[i] + b[i]; > > > > > } > > > > > //stopVdebug(); > > > > > } > > > > > > > > > > proc addCollective(c : [ProblemDomain] float, a : > [ProblemDomain] float, > > > > > b : [ProblemDomain] float) { > > > > > c = a + b; > > > > > } > > > > > > > > > > proc output(t : Timer, n, testName) { > > > > > t.stop(); > > > > > writeln(testName, " n: ", n); > > > > > writeln("Time: ", t.elapsed(), "s"); > > > > > writeln("GFLOPS: ", n / t.elapsed() / 1e9, ""); > > > > > writeln(); > > > > > t.clear(); > > > > > } > > > > > > > > > > proc main() { > > > > > var c : [ProblemDomain] float; > > > > > var a : [ProblemDomain] float; > > > > > var b : [ProblemDomain] float; > > > > > var t : Timer; > > > > > > > > > > fillRandom(a, 0); > > > > > fillRandom(b, 42); > > > > > > > > > > t.start(); > > > > > addNoDomain(c, a, b); > > > > > output(t, n, "addNoDomain"); > > > > > > > > > > t.start(); > > > > > addZip(c, a, b); > > > > > output(t, n, "addZip"); > > > > > > > > > > t.start(); > > > > > addForall(c, a, b); > > > > > output(t, n, "addForall"); > > > > > > > > > > t.start(); > > > > > addCollective(c, a, b); > > > > > output(t, n, "addCollective"); > > > > > } > > > > > ----- > > > > > > > > > > On a single locale I get as output: > > > > > > > > > > addNoDomain n: 536870912 > > > > > Time: 0.27961s > > > > > GFLOPS: 1.92007 > > > > > > > > > > addZip n: 536870912 > > > > > Time: 0.278657s > > > > > GFLOPS: 1.92664 > > > > > > > > > > addForall n: 536870912 > > > > > Time: 0.278015s > > > > > GFLOPS: 1.93109 > > > > > > > > > > addCollective n: 536870912 > > > > > Time: 0.278379s > > > > > GFLOPS: 1.92856 > > > > > > > > > > On multi-locale (-nl 1) I get as output: > > > > > > > > > > addNoDomain n: 536870912 > > > > > Time: 2.16806s > > > > > GFLOPS: 0.247627 > > > > > > > > > > addZip n: 536870912 > > > > > Time: 2.17024s > > > > > GFLOPS: 0.247378 > > > > > > > > > > addForall n: 536870912 > > > > > Time: 4.78443s > > > > > GFLOPS: 0.112212 > > > > > > > > > > addCollective n: 536870912 > > > > > Time: 2.19838s > > > > > GFLOPS: 0.244212 > > > > > > > > > > So, indeed, your suggestion improves it by more > than a factor two, but > > > > > it is still close to a factor 8 slower than > single-locale. > > > > > > > > > > I also used chplvis and verified that there are > no gets and puts when > > > > > running multi-locale with more than one node. > The profiling information > > > > > is clear, but not very helpful (to me): > > > > > > > > > > multi-locale (-nl 1): > > > > > > > > > > | 65.3451 | wrapcoforall_fn_chpl5 | > vectoradd.chpl:26 | > > > > > | 4.8777 | wrapon_fn_chpl35 | > vectoradd.chpl:26 | > > > > > > > > > > single-locale: > > > > > > > > > > | 5.0019 | wrapcoforall_fn_chpl5 | > vectoradd.chpl:26 | > > > > > > > > > > > > > > > > > > > > For stencil operations, I used the following > program: > > > > > > > > > > 1d-convolution.chpl > > > > > ------------------- > > > > > use Time; > > > > > use Random; > > > > > use StencilDist; > > > > > > > > > > config const n = 1024**3/2; > > > > > > > > > > const ProblemDomain : domain(1) dmapped > Stencil(boundingBox = {0..#n}, > > > > > fluff = > (1,)) > > > > > = {0..#n}; > > > > > const InnerDomain : subdomain(ProblemDomain) = > {1..n-2}; > > > > > > > > > > proc convolveIndices(output : [ProblemDomain] > real(32), > > > > > input : [ProblemDomain] real(32)) { > > > > > forall i in InnerDomain { > > > > > output[i] = ((input[i-1] + input[i] + > input[i+1])/3:real(32)); > > > > > } > > > > > } > > > > > > > > > > proc convolveZip(output : [ProblemDomain] > real(32), > > > > > input : [ProblemDomain] real(32)) { > > > > > forall (im1, i, ip1) in > zip(InnerDomain.translate(-1), > > > > > InnerDomain, > > > > > InnerDomain.translate(1)) > { > > > > > output[i] = ((input[im1] + input[i] + > input[ip1])/3:real(32)); > > > > > } > > > > > } > > > > > > > > > > proc print(t : Timer, n, s) { > > > > > t.stop(); > > > > > writeln(s, ", n: ", n); > > > > > writeln("Time: ", t.elapsed(), "s"); > > > > > writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed()); > > > > > writeln(); > > > > > t.clear(); > > > > > } > > > > > > > > > > proc main() { > > > > > var input : [ProblemDomain] real(32); > > > > > var output : [ProblemDomain] real(32); > > > > > var t : Timer; > > > > > > > > > > fillRandom(input, 42); > > > > > > > > > > t.start(); > > > > > convolveIndices(output, input); > > > > > print(t, n, "convolveIndices"); > > > > > > > > > > t.start(); > > > > > convolveZip(output, input); > > > > > print(t, n, "convolveZip"); > > > > > } > > > > > ------ > > > > > > > > > > Interestingly, in contrast to your earlier > suggestion, the direct > > > > > indexing works a bit better in this program than > the zipped version: > > > > > > > > > > Multi-locale (-nl 1): > > > > > > > > > > convolveIndices, n: 536870912 > > > > > Time: 4.27148s > > > > > GFLOPS: 0.377062 > > > > > > > > > > convolveZip, n: 536870912 > > > > > Time: 4.87291s > > > > > GFLOPS: 0.330524 > > > > > > > > > > Single-locale: > > > > > > > > > > convolveIndices, n: 536870912 > > > > > Time: 0.548804s > > > > > GFLOPS: 2.93477 > > > > > > > > > > convolveZip, n: 536870912 > > > > > Time: 0.538754s > > > > > GFLOPS: 2.98951 > > > > > > > > > > > > > > > Again, the multi-locale is about a factor 8 > slower than single-locale. > > > > > By the way, the Stencil distribution is a bit > faster than the Block > > > > > distribution. > > > > > > > > > > Thanks in advance for your input, > > > > > > > > > > Pieter > > > > > > > > > > > > > > > > > > > > On 24/10/16 19:20, Ben Harshbarger wrote: > > > > > > Hi Pieter, > > > > > > > > > > > > Thanks for providing the example, that's very > helpful. > > > > > > > > > > > > Multi-locale performance in Chapel is not yet > where we'd like it to be, but we've done a lot of work over the past few > releases to get cases like yours performing well. It's surprising that using > Block results in that much of a difference, but I think you would see better > performance by iterating over the arrays directly: > > > > > > > > > > > > ``` > > > > > > // replace the loop in the 'add' function with > this: > > > > > > forall (ci, ai, bi) in zip(c, a, b) { > > > > > > ci = ai + bi; > > > > > > } > > > > > > ``` > > > > > > > > > > > > Block-distributed arrays can leverage the > fast-follower optimization to perform better when all arrays being iterated > over share the same domain. You can also write that loop in a cleaner way by > leveraging array promotion: > > > > > > > > > > > > ``` > > > > > > // This is equivalent to the first loop > > > > > > c = a + b; > > > > > > ``` > > > > > > > > > > > > However, when I tried the promoted variation on > my machine I observed worse performance than the explicit forall-loop. It > seems to be related to the way the arguments of the 'add' function are > declared. If you replaced "[ProblemDomain] float" with "[] float", > performance seems to improve. That surprised a couple of us on the > development team, and I'll be looking at that some more today. > > > > > > > > > > > > If you're still seeing significantly worse > performance with Block compared to the default rectangular domain, and the > programs are launched in the same way, that would be odd. You could try > profiling using chplvis. I agree though that there shouldn't be any > communication in this program. You can find more information on chplvis here > in the online 1.14 release documentation: > > > > > > > > > > > > > http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html > > > > > > > > > > > > I hope that rewriting the loops solves the > problem, but let us know if it doesn't and we can continue investigating. > > > > > > > > > > > > -Ben Harshbarger > > > > > > > > > > > > On 10/24/16, 6:19 AM, "Pieter Hijma" > <[email protected]> wrote: > > > > > > > > > > > > Dear all, > > > > > > > > > > > > My apologies if this has already been asked > before. I'm new to the list > > > > > > and couldn't find it in the archives. > > > > > > > > > > > > I experience bad performance when running > the multi-locale compiled > > > > > > version on an InfiniBand equiped cluster > > > > > > (http://cs.vu.nl/das4/clusters.shtml, > VU-site), even with only one node. > > > > > > Below you find a minimal example that > exhibits the same performance > > > > > > problems as all my programs: > > > > > > > > > > > > I compiled chapel-1.14.0 with the following > steps: > > > > > > > > > > > > export CHPL_TARGET_ARCH=native > > > > > > make -j > > > > > > export CHPL_COMM=gasnet > > > > > > export CHPL_COMM_SUBSTRATE=ibv > > > > > > make clean > > > > > > make -j > > > > > > > > > > > > I compile the following Chapel code: > > > > > > > > > > > > vectoradd.chpl: > > > > > > --------------- > > > > > > use Time; > > > > > > use Random; > > > > > > use BlockDist; > > > > > > > > > > > > config const n = 1024**3; > > > > > > > > > > > > // for single-locale > > > > > > // const ProblemDomain : domain(1) = > {0..#n}; > > > > > > // for multi-locale > > > > > > const ProblemDomain : domain(1) dmapped > Block(boundingBox = {0..#n}) = > > > > > > {0..#n}; > > > > > > > > > > > > type float = real(32); > > > > > > > > > > > > proc add(c : [ProblemDomain] float, a : > [ProblemDomain] float, > > > > > > b : [ProblemDomain] float) { > > > > > > forall i in ProblemDomain { > > > > > > c[i] = a[i] + b[i]; > > > > > > } > > > > > > } > > > > > > > > > > > > proc main() { > > > > > > var c : [ProblemDomain] float; > > > > > > var a : [ProblemDomain] float; > > > > > > var b : [ProblemDomain] float; > > > > > > var t : Timer; > > > > > > > > > > > > fillRandom(a, 0); > > > > > > fillRandom(b, 42); > > > > > > > > > > > > t.start(); > > > > > > add(c, a, b); > > > > > > t.stop(); > > > > > > > > > > > > writeln("n: ", n); > > > > > > writeln("Time: ", t.elapsed(), "s"); > > > > > > writeln("GFLOPS: ", n / t.elapsed() / > 1e9, "s"); > > > > > > } > > > > > > ---- > > > > > > > > > > > > I compile this for single-locale with > (using no domain maps, see the > > > > > > comment above in the source): > > > > > > > > > > > > chpl -o vectoradd --fast vectoradd.chpl > > > > > > > > > > > > I run it with (dual quad core with 2 > hardware threads): > > > > > > > > > > > > export CHPL_RT_NUM_THREADS_PER_LOCALE=16 > > > > > > ./vectoradd > > > > > > > > > > > > And get as output: > > > > > > > > > > > > n: 1073741824 > > > > > > Time: 0.558806s > > > > > > GFLOPS: 1.92149s > > > > > > > > > > > > However, the performance for multi-locale > is much worse: > > > > > > > > > > > > I compile this for multi-locale with domain > maps, see the comment in the > > > > > > source): > > > > > > > > > > > > CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv > chpl -o vectoradd --fast \ > > > > > > vectoradd.chpl > > > > > > > > > > > > I run it on the same type of node with: > > > > > > > > > > > > SSH_SERVERS=`uniq $TMPDIR/machines | tr > '\n' ' '` > > > > > > > > > > > > export GASNET_PHYSMEM_MAX=1G > > > > > > export GASNET_IBV_SPAWNER=ssh > > > > > > export GASNET_SSH_SERVERS="$SSH_SERVERS" > > > > > > > > > > > > export CHPL_RT_NUM_THREADS_PER_LOCALE=16 > > > > > > export CHPL_LAUNCHER=gasnetrun_ibv > > > > > > export CHPL_COMM=gasnet > > > > > > export CHPL_COMM_SUBSTRATE=ibv > > > > > > > > > > > > ./vectoradd -nl 1 > > > > > > > > > > > > And get as output: > > > > > > > > > > > > n: 1073741824 > > > > > > Time: 8.65082s > > > > > > GFLOPS: 0.12412s > > > > > > > > > > > > I would understand a performance difference > of say 10% because of > > > > > > multi-locale execution, but not factors. > Is this to be expected from > > > > > > the current state of Chapel? This > performance difference is examplary > > > > > > for basically all my programs that also are > more realistic and use > > > > > > larger inputs. The performance is strange > as there is no communication > > > > > > necessary (only one node) and the program > is using the same amount of > > > > > > threads. > > > > > > > > > > > > Is there any way for me to investigate this > using profiling for example? > > > > > > > > > > > > By the way, the program does scale well to > multiple nodes (which is not > > > > > > difficult given the baseline): > > > > > > > > > > > > 1 | 8.65s > > > > > > 2 | 2.67s > > > > > > 4 | 1.69s > > > > > > 8 | 0.87s > > > > > > 16 | 0.41s > > > > > > > > > > > > Thanks in advance for your input. > > > > > > > > > > > > Kind regards, > > > > > > > > > > > > Pieter Hijma > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > Check out the vibrant tech community on one > of the world's most > > > > > > engaging tech sites, SlashDot.org! > http://sdm.link/slashdot > > > > > > > _______________________________________________ > > > > > > Chapel-users mailing list > > > > > > [email protected] > > > > > > > https://lists.sourceforge.net/lists/listinfo/chapel-users > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ _______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
