Re: Performance multi-locale

Pieter Hijma Fri, 18 Nov 2016 01:20:30 -0800

Hi Ben,

No thanks, thank you for your help and explanations!  I would certainly 
be interested in learning about your results.


Kind regards,

Pieter

On 17/11/16 18:25, Ben Harshbarger wrote:
> Hi Pieter,
>
> I think that's a fair conclusion. Thanks for being patient and providing the 
> code examples and timings. On my end, I'll try and work on creating some 
> mini-benchmarks to track this difference more clearly (for both the vectorAdd 
> and 1D-convolution use-cases).
>
> As far as understanding overhead for direct-indexing in BlockDist, here are 
> some issues that I'm aware of:
> - Overhead when the compiler is unable to infer that data is local. In such 
> cases we introduce "wide pointers", which the back-end C compiler may not be 
> able to optimize as effectively as a normal pointer. This is what may have 
> been impacted by switching from gcc 4.4 to 6.2.
> - Overhead to check which locale "owns" an index
> - Possible overhead for privatized objects
>
> Though this isn't a satisfying conclusion, hopefully it helps to get some 
> sense of how different patterns perform.
>
> -Ben Harshbarger
>
> On 11/17/16, 1:28 AM, "Pieter Hijma" <[email protected]> wrote:
>
>     Hi Ben,
>
>     Same setup, testing with GCC 6.2.0, single-locale in directory
>     'datapar', multi-locale in directory 'datapar-dist'.
>
>     Similarly to the 1D-convolution case, the Makefiles are the same:
>
>     $ diff datapar/Makefile datapar/Makefile
>
>     This means that I compile both programs with:
>
>     $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o vectoradd --fast \
>          vectoradd.chpl
>
>     The job files are also the same:
>
>     $ diff datapar/das4.job datapar/das4.job
>
>     The contents of das4.job (an SGE script), basically the same as for the
>     1D-convolution:
>
>     ---
>     #!/bin/bash
>     #$ -l h_rt=0:15:00
>     #$ -N VECTORADD
>     #$ -cwd
>
>     . ~/.bashrc
>
>     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
>
>     export CHPL_COMM=gasnet
>     export CHPL_COMM_SUBSTRATE=ibv
>     export CHPL_LAUNCHER=gasnetrun_ibv
>     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>
>     export GASNET_IBV_SPAWNER=ssh
>     export GASNET_PHYSMEM_MAX=1G
>     export GASNET_SSH_SERVERS="$SSH_SERVERS"
>
>     APP=./vectoradd
>     ARGS=$*
>
>     $APP $ARGS
>     ---
>
>     The two source files differ in the use of domain-maps:
>
>     $ diff datapar/vectoradd.chpl datapar-dist/vectoradd.chpl
>     2a3
>      > use BlockDist;
>     7c8
>     < const ProblemDomain : domain(1) = {0..#n};
>     ---
>      > const ProblemDomain : domain(1) dmapped Block(boundingBox = {0..#n})
>     = {0..#n};
>     ERROR: 1
>
>
>     The output of 'datapar', originally the single-locale version:
>
>     addNoDomain n: 536870912
>     Time: 0.329722s
>     GFLOPS: 1.62825
>
>     addZip n: 536870912
>     Time: 0.328751s
>     GFLOPS: 1.63306
>
>     addForall n: 536870912
>     Time: 0.325768s
>     GFLOPS: 1.64802
>
>     addCollective n: 536870912
>     Time: 0.330918s
>     GFLOPS: 1.62237
>
>
>
>     The output of 'datapar-dist', the multi-locale version:
>
>     addNoDomain n: 536870912
>     Time: 0.373368s
>     GFLOPS: 1.43791
>
>     addZip n: 536870912
>     Time: 0.372561s
>     GFLOPS: 1.44103
>
>     addForall n: 536870912
>     Time: 2.66822s
>     GFLOPS: 0.201209
>
>     addCollective n: 536870912
>     Time: 0.36856s
>     GFLOPS: 1.45667
>
>
>     I guess the conclusion is that also in this case, the use of the
>     BlockDist has an effect, minor overall, and major when indexing directly.
>
>     Kind regards,
>
>     Pieter Hijma
>
>     On 16/11/16 20:32, Ben Harshbarger wrote:
>     > Hi Pieter,
>     >
>     > Do you still see a problem with vectorAdd.chpl? I think 1D-convolution 
> has a somewhat separate performance issue due to accessing with arbitrary 
> indices (more overhead because we don't know if the index is local). If 
> vectorAdd isn't performing well, then that could hurt 1D-convolution too.
>     >
>     > -Ben Harshbarger
>     >
>     > On 11/16/16, 3:33 AM, "Pieter Hijma" <[email protected]> wrote:
>     >
>     >     Hi Ben,
>     >
>     >     Good suggestion, I'm going to test with the 1D-convolution program 
> and I
>     >     use the Chapel version compiled with GCC 6.2.0.  The single-locale
>     >     version is in directory 'datapar' and the multi-locale version is in
>     >     directory 'datapar-dist'.
>     >
>     >     The Makefiles are now the same:
>     >
>     >     $ diff datapar/Makefile datapar/Makefile
>     >
>     >     This means that I compile both programs with:
>     >
>     >     $ CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv chpl -o 1D-convolution \
>     >        --fast 1D-convolution.chpl
>     >
>     >     The job files are also the same:
>     >
>     >     $ diff datapar/das4.job datapar/das4.job
>     >
>     >     The contents of das4.job (an SGE script):
>     >
>     >     -----
>     >     #!/bin/bash
>     >     #$ -l h_rt=0:15:00
>     >     #$ -N CONVOLUTION_1D
>     >     #$ -cwd
>     >
>     >     . ~/.bashrc
>     >
>     >     SSH_SERVERS=`uniq $TMPDIR/machines  | tr '\n' ' '`
>     >
>     >     export CHPL_COMM=gasnet
>     >     export CHPL_COMM_SUBSTRATE=ibv
>     >     export CHPL_LAUNCHER=gasnetrun_ibv
>     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>     >
>     >     export GASNET_IBV_SPAWNER=ssh
>     >     export GASNET_PHYSMEM_MAX=1G
>     >     export GASNET_SSH_SERVERS="$SSH_SERVERS"
>     >
>     >     APP=./1D-convolution
>     >     ARGS=$*
>     >
>     >     $APP $ARGS
>     >     ------
>     >
>     >     The difference between the two source files is now basically that 
> the
>     >     single-locale version has only the default domain map, whereas the
>     >     multi-locale has the BlockDist domain map:
>     >
>     >     $ diff datapar/1D-convolution.chpl datapar-dist/1D-convolution.chpl
>     >     2a3
>     >      > use BlockDist;
>     >     6c7
>     >     < const ProblemDomain : domain(1) = {0..#n};
>     >     ---
>     >      > const ProblemDomain : domain(1) dmapped Block(boundingBox = 
> {0..#n})
>     >     = {0..#n};
>     >     ERROR: 1
>     >
>     >
>     >     The output of 'datapar', the original single-locale version without
>     >     BlockDist:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 0.319077s
>     >     GFLOPS: 5.04772
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 0.320788s
>     >     GFLOPS: 5.0208
>     >
>     >
>     >     The output of 'datapar-dist', the original multi-locale version with
>     >     BlockDist:
>     >
>     >     convolveIndices, n: 536870912
>     >     Time: 3.1422s
>     >     GFLOPS: 0.512575
>     >
>     >     convolveZip, n: 536870912
>     >     Time: 3.54989s
>     >     GFLOPS: 0.453708
>     >
>     >
>     >     I guess we can conclude that only the addition of the BlockDist 
> domain
>     >     map to the ProblemDomain results in a factor of 10 slowdown.
>     >
>     >     Kind regards,
>     >
>     >     Pieter Hijma
>     >
>     >     On 14/11/16 20:21, Ben Harshbarger wrote:
>     >     > Hi Pieter,
>     >     >
>     >     > My next suggestion would be to try compiling and running the 
> "single locale" variation with the same environment variables that you use 
> for multilocale. I'm wondering if the use of IBV is impacting performance in 
> some way. I don't see the performance issue on our internal ibv cluster, but 
> it's worth checking.
>     >     >
>     >     > -Ben Harshbarger
>     >     >
>     >     > On 11/8/16, 12:25 PM, "Pieter Hijma" <[email protected]> wrote:
>     >     >
>     >     >     Hi Ben,
>     >     >
>     >     >     Thanks for your help.
>     >     >
>     >     >     On 07/11/16 18:59, Ben Harshbarger wrote:
>     >     >     > When CHPL_COMM is set to 'none', our compiler can avoid 
> introducing some overhead that is necessary for multi-locale programs. You 
> can force this overhead when CHPL_COMM == none by compiling with the flag 
> "--no-local". If you compile your single-locale program with that flag, does 
> the performance get worse?
>     >     >
>     >     >     It makes some difference, but not much:
>     >     >
>     >     >     chpl -o vectoradd --fast vectoradd.chpl
>     >     >
>     >     >     addNoDomain n: 1073741824
>     >     >     Time: 0.57211s
>     >     >     GFLOPS: 1.87681
>     >     >
>     >     >     addZip n: 1073741824
>     >     >     Time: 0.571799s
>     >     >     GFLOPS: 1.87783
>     >     >
>     >     >     addForall n: 1073741824
>     >     >     Time: 0.571623s
>     >     >     GFLOPS: 1.87841
>     >     >
>     >     >     addCollective n: 1073741824
>     >     >     Time: 0.571395s
>     >     >     GFLOPS: 1.87916
>     >     >
>     >     >
>     >     >     chpl -o vectoradd --fast --no-local vectoradd.chpl
>     >     >
>     >     >     addNoDomain n: 1073741824
>     >     >     Time: 0.62087s
>     >     >     GFLOPS: 1.72941
>     >     >
>     >     >     addZip n: 1073741824
>     >     >     Time: 0.619997s
>     >     >     GFLOPS: 1.73185
>     >     >
>     >     >     addForall n: 1073741824
>     >     >     Time: 0.620645s
>     >     >     GFLOPS: 1.73004
>     >     >
>     >     >     addCollective n: 1073741824
>     >     >     Time: 0.620254s
>     >     >     GFLOPS: 1.73113
>     >     >
>     >     >
>     >     >     > If that's the case, I'm not entirely sure what the next 
> step would be. Do you have access to a newer version of GCC? The backend C 
> compiler can matter when it comes to optimizing the multi-locale overhead.
>     >     >
>     >     >     It is indeed an old one.  We also have GCC 4.9.0, Intel 13.3, 
> and I
>     >     >     compiled GCC 6.2.0 to check:
>     >     >
>     >     >     * intel/compiler/64/13.3/2013.3.163
>     >     >
>     >     >     I basically see the same behavior:
>     >     >
>     >     >     single locale:
>     >     >
>     >     >     addNoDomain n: 536870912
>     >     >     Time: 0.285186s
>     >     >     GFLOPS: 1.88253
>     >     >
>     >     >     addZip n: 536870912
>     >     >     Time: 0.284819s
>     >     >     GFLOPS: 1.88495
>     >     >
>     >     >     addForall n: 536870912
>     >     >     Time: 0.287904s
>     >     >     GFLOPS: 1.86476
>     >     >
>     >     >     addCollective n: 536870912
>     >     >     Time: 0.284912s
>     >     >     GFLOPS: 1.88434
>     >     >
>     >     >     multi-locale, one node:
>     >     >
>     >     >     addNoDomain n: 536870912
>     >     >     Time: 3.24471s
>     >     >     GFLOPS: 0.16546
>     >     >
>     >     >     addZip n: 536870912
>     >     >     Time: 3.01287s
>     >     >     GFLOPS: 0.178192
>     >     >
>     >     >     addForall n: 536870912
>     >     >     Time: 7.23895s
>     >     >     GFLOPS: 0.0741642
>     >     >
>     >     >     addCollective n: 536870912
>     >     >     Time: 2.59501s
>     >     >     GFLOPS: 0.206886
>     >     >
>     >     >
>     >     >     * GCC 4.9.0
>     >     >
>     >     >     This is encouraging, the performance improves, a factor two 
> of the
>     >     >     single-locale, except for the explicit indices in the forall:
>     >     >
>     >     >     single locale:
>     >     >
>     >     >     addNoDomain n: 536870912
>     >     >     Time: 0.277222s
>     >     >     GFLOPS: 1.93661
>     >     >
>     >     >     addZip n: 536870912
>     >     >     Time: 0.27566s
>     >     >     GFLOPS: 1.94758
>     >     >
>     >     >     addForall n: 536870912
>     >     >     Time: 0.27609s
>     >     >     GFLOPS: 1.94455
>     >     >
>     >     >     addCollective n: 536870912
>     >     >     Time: 0.275303s
>     >     >     GFLOPS: 1.95011
>     >     >
>     >     >     multi-locale, single node:
>     >     >
>     >     >     addNoDomain n: 536870912
>     >     >     Time: 0.492954s
>     >     >     GFLOPS: 1.08909
>     >     >
>     >     >     addZip n: 536870912
>     >     >     Time: 0.493039s
>     >     >     GFLOPS: 1.0889
>     >     >
>     >     >     addForall n: 536870912
>     >     >     Time: 2.85323s
>     >     >     GFLOPS: 0.188162
>     >     >
>     >     >     addCollective n: 536870912
>     >     >     Time: 0.492135s
>     >     >     GFLOPS: 1.0909
>     >     >
>     >     >
>     >     >     * GCC 6.2.0
>     >     >
>     >     >     The performance on multi-locale is now even better.  Still 
> very low for
>     >     >     explicit indices in the forall.
>     >     >
>     >     >     single locale:
>     >     >
>     >     >     addNoDomain n: 536870912
>     >     >     Time: 0.283272s
>     >     >     GFLOPS: 1.89525
>     >     >
>     >     >     addZip n: 536870912
>     >     >     Time: 0.281942s
>     >     >     GFLOPS: 1.90419
>     >     >
>     >     >     addForall n: 536870912
>     >     >     Time: 0.282291s
>     >     >     GFLOPS: 1.90184
>     >     >
>     >     >     addCollective n: 536870912
>     >     >     Time: 0.281629s
>     >     >     GFLOPS: 1.90631
>     >     >
>     >     >     Multi-locale, single node:
>     >     >
>     >     >     addNoDomain n: 536870912
>     >     >     Time: 0.358012s
>     >     >     GFLOPS: 1.49959
>     >     >
>     >     >     addZip n: 536870912
>     >     >     Time: 0.356696s
>     >     >     GFLOPS: 1.50512
>     >     >
>     >     >     addForall n: 536870912
>     >     >     Time: 2.92173s
>     >     >     GFLOPS: 0.183751
>     >     >
>     >     >     addCollective n: 536870912
>     >     >     Time: 0.343808s
>     >     >     GFLOPS: 1.56154
>     >     >
>     >     >
>     >     >
>     >     >     Since this is encouraging, I also verified the performance of 
> the
>     >     >     1D-stencils:
>     >     >
>     >     >     * GCC 4.4.7
>     >     >
>     >     >     For reference, the old compiler that I used initially:
>     >     >
>     >     >     single locale:
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 0.82361s
>     >     >     GFLOPS: 1.95555
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 0.810028s
>     >     >     GFLOPS: 1.98834
>     >     >
>     >     >     mutli-locale, one node:
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 4.25951s
>     >     >     GFLOPS: 0.378122
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 4.88046s
>     >     >     GFLOPS: 0.330012
>     >     >
>     >     >     * intel/compiler/64/13.3/2013.3.163
>     >     >
>     >     >     On this compiler the single-node performance is better than 
> the previous
>     >     >     compiler.  However, the multi-locale one node performance is 
> about a
>     >     >     factor 3 slower than the previous compiler.
>     >     >
>     >     >     single locale:
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 0.554139s
>     >     >     GFLOPS: 2.90651
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 0.556653s
>     >     >     GFLOPS: 2.89339
>     >     >
>     >     >
>     >     >     multi-locale, one node:
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 10.5368s
>     >     >     GFLOPS: 0.152856
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 12.7625s
>     >     >     GFLOPS: 0.126198
>     >     >
>     >     >
>     >     >     * GCC 4.9.0
>     >     >
>     >     >     The performance of single locale is much better than GCC 
> 4.4.7, however
>     >     >     still poor for the multi-locale, one node configuration, 
> although a bit
>     >     >     better.
>     >     >
>     >     >     single locale:
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 0.207055s
>     >     >     GFLOPS: 7.77867
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 0.206783s
>     >     >     GFLOPS: 7.7889
>     >     >
>     >     >     multi-locale, one node:
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 3.20851s
>     >     >     GFLOPS: 0.501981
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 3.652s
>     >     >     GFLOPS: 0.441023
>     >     >
>     >     >
>     >     >     * GCC 6.2.0
>     >     >
>     >     >     Strangely enough, the performance of single-locale is a bit 
> lower than
>     >     >     the previous, and the same as with multi-locale, one node.
>     >     >
>     >     >     single-locale:
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 0.263151s
>     >     >     GFLOPS: 6.12049
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 0.262234s
>     >     >     GFLOPS: 6.14189
>     >     >
>     >     >     multi-locale, one node:
>     >     >
>     >     >     convolveIndices, n: 536870912
>     >     >     Time: 3.12716s
>     >     >     GFLOPS: 0.515039
>     >     >
>     >     >     convolveZip, n: 536870912
>     >     >     Time: 3.58663s
>     >     >     GFLOPS: 0.44906
>     >     >
>     >     >
>     >     >     The conclusion is that the compiler has indeed a large impact 
> on the
>     >     >     multi-locale performance, but probably only in the simple 
> cases such as
>     >     >     vector addition.  With the stencil code, although it is not 
> very
>     >     >     complicated, the performance falls back into the pattern that 
> I came
>     >     >     across originally.
>     >     >
>     >     >     However, perhaps this gives you an idea of the optimizations 
> that impact
>     >     >     the performance?  If we can't find a solution, I would at 
> least like to
>     >     >     understand the lack of performance.
>     >     >
>     >     >     I also checked the performance of the stencils by not using 
> the
>     >     >     StencilDist but just the BlockDist and it makes no difference.
>     >     >
>     >     >     > You may also want to consider setting CHPL_TARGET_ARCH to 
> something else if you're compiling on a machine architecture different from 
> the compute nodes. There's more information about CHPL_TARGET_ARCH here:
>     >     >     >
>     >     >     > 
> http://chapel.cray.com/docs/latest/usingchapel/chplenv.html#chpl-target-arch
>     >     >
>     >     >     The head-node and compute-nodes are all Intel Xeon 
> Westmere's, so I
>     >     >     don't think that makes a difference.  To be absolutely sure, 
> I also
>     >     >     compiled Chapel and the applications on a compute node and 
> indeed, the
>     >     >     performance is comparable to all measurements above.
>     >     >
>     >     >     Kind regards,
>     >     >
>     >     >     Pieter Hijma
>     >     >
>     >     >
>     >     >     > On 11/7/16, 2:16 AM, "Pieter Hijma" <[email protected]> wrote:
>     >     >     >
>     >     >     >     Dear Ben,
>     >     >     >
>     >     >     >     Sorry for my late reactions.  Unfortunately, for some 
> reason, these
>     >     >     >     emails are marked as spam even though I marked the list 
> and your address
>     >     >     >     as safe.  I will make sure I check my spam folders 
> meticulously from now on.
>     >     >     >
>     >     >     >     On 28/10/16 23:34, Ben Harshbarger wrote:
>     >     >     >     > Hi Pieter,
>     >     >     >     >
>     >     >     >     > Sorry that you're still having issues. I think we'll 
> need some more information before going forward:
>     >     >     >     >
>     >     >     >     > 1) Could you send us the output of 
> "$CHPL_HOME/util/printchplenv --anonymize" ? It's a script that displays the 
> various CHPL_ environment variables. "--anonymize" strips the output of 
> information you may prefer to keep private (machine info, paths).
>     >     >     >
>     >     >     >     This would be the setup if running single-locale 
> programs:
>     >     >     >
>     >     >     >     $ printchplenv --anonymize
>     >     >     >     CHPL_TARGET_PLATFORM: linux64
>     >     >     >     CHPL_TARGET_COMPILER: gnu
>     >     >     >     CHPL_TARGET_ARCH: native *
>     >     >     >     CHPL_LOCALE_MODEL: flat
>     >     >     >     CHPL_COMM: none
>     >     >     >     CHPL_TASKS: qthreads
>     >     >     >     CHPL_LAUNCHER: none
>     >     >     >     CHPL_TIMERS: generic
>     >     >     >     CHPL_UNWIND: none
>     >     >     >     CHPL_MEM: jemalloc
>     >     >     >     CHPL_MAKE: gmake
>     >     >     >     CHPL_ATOMICS: intrinsics
>     >     >     >     CHPL_GMP: gmp
>     >     >     >     CHPL_HWLOC: hwloc
>     >     >     >     CHPL_REGEXP: re2
>     >     >     >     CHPL_WIDE_POINTERS: struct
>     >     >     >     CHPL_AUX_FILESYS: none
>     >     >     >
>     >     >     >     When I run multi-locale programs, I set the following 
> environment variables:
>     >     >     >
>     >     >     >     export CHPL_COMM=gasnet
>     >     >     >     export CHPL_COMM_SUBSTRATE=ibv
>     >     >     >
>     >     >     >     Then the Chapel environment would be:
>     >     >     >
>     >     >     >     $ printchplenv --anonymize
>     >     >     >     CHPL_TARGET_PLATFORM: linux64
>     >     >     >     CHPL_TARGET_COMPILER: gnu
>     >     >     >     CHPL_TARGET_ARCH: native *
>     >     >     >     CHPL_LOCALE_MODEL: flat
>     >     >     >     CHPL_COMM: gasnet *
>     >     >     >        CHPL_COMM_SUBSTRATE: ibv *
>     >     >     >        CHPL_GASNET_SEGMENT: large
>     >     >     >     CHPL_TASKS: qthreads
>     >     >     >     CHPL_LAUNCHER: gasnetrun_ibv
>     >     >     >     CHPL_TIMERS: generic
>     >     >     >     CHPL_UNWIND: none
>     >     >     >     CHPL_MEM: jemalloc
>     >     >     >     CHPL_MAKE: gmake
>     >     >     >     CHPL_ATOMICS: intrinsics
>     >     >     >        CHPL_NETWORK_ATOMICS: none
>     >     >     >     CHPL_GMP: gmp
>     >     >     >     CHPL_HWLOC: hwloc
>     >     >     >     CHPL_REGEXP: re2
>     >     >     >     CHPL_WIDE_POINTERS: struct
>     >     >     >     CHPL_AUX_FILESYS: none
>     >     >     >
>     >     >     >
>     >     >     >     > 2) What C compiler are you using?
>     >     >     >
>     >     >     >     $ gcc --version
>     >     >     >     gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
>     >     >     >     Copyright (C) 2010 Free Software Foundation, Inc.
>     >     >     >     This is free software; see the source for copying 
> conditions.  There is NO
>     >     >     >     warranty; not even for MERCHANTABILITY or FITNESS FOR A 
> PARTICULAR PURPOSE.
>     >     >     >
>     >     >     >     > 3) Are you sure that the programs are being launched 
> correctly? This might seem silly, but it's worth double-checking that the 
> programs are actually running on the same hardware (not necessarily the same 
> node though).
>     >     >     >
>     >     >     >     I am completely certain that the single-locale program, 
> the multi-locale
>     >     >     >     program for one node, and the multi-locale for multiple 
> nodes are
>     >     >     >     running on the compute nodes.  I'm not completely sure 
> what you mean by
>     >     >     >     "the same hardware".  All compute nodes have the same 
> hardware if that
>     >     >     >     is what you mean.
>     >     >     >
>     >     >     >     > I'd also like to clarify what you mean by 
> "multi-locale compiled". Is the difference between the programs just the use 
> of the Block domain map, or do you compile with different environment 
> variables set?
>     >     >     >
>     >     >     >     I compile different programs and I use different 
> environment variables:
>     >     >     >
>     >     >     >     The single-locale version vectoradd is located in the 
> datapar directory,
>     >     >     >     whereas the multi-locale version is located in the 
> datapar-dist
>     >     >     >     directory.  What follows is the diff for the .chpl file:
>     >     >     >
>     >     >     >     $ diff datapar/vectoradd.chpl 
> datapar-dist/vectoradd.chpl
>     >     >     >     8c8
>     >     >     >     < const ProblemDomain : domain(1) = {0..#n};
>     >     >     >     ---
>     >     >     >      > const ProblemDomain : domain(1) dmapped 
> Block(boundingBox = {0..#n})
>     >     >     >     = {0..#n};
>     >     >     >
>     >     >     >     The diff for the Makefile:
>     >     >     >
>     >     >     >     $ diff datapar/Makefile datapar-dist/Makefile
>     >     >     >     2a3
>     >     >     >      > DIST_FLAGS = CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
>     >     >     >     8c9
>     >     >     >     <       $(CHPL) -o $@ $(FLAGS) $<
>     >     >     >     ---
>     >     >     >      >      $(DIST_FLAGS) $(CHPL) -o $@ $(FLAGS) $<
>     >     >     >     11c12
>     >     >     >     <       rm -f $(APP)
>     >     >     >     ---
>     >     >     >      >      rm -f $(APP) $(APP)_real
>     >     >     >
>     >     >     >     Thanks for your help, and again my apologies for the 
> delayed answers.
>     >     >     >
>     >     >     >     Kind regards,
>     >     >     >
>     >     >     >     Pieter Hijma
>     >     >     >
>     >     >     >     >
>     >     >     >     > -Ben Harshbarger
>     >     >     >     >
>     >     >     >     > On 10/27/16, 5:19 AM, "Pieter Hijma" <[email protected]> 
> wrote:
>     >     >     >     >
>     >     >     >     >     Hi Ben,
>     >     >     >     >
>     >     >     >     >     Thank you for your fast reply and suggestions!  I 
> did some more tests
>     >     >     >     >     and also included stencil operations.
>     >     >     >     >
>     >     >     >     >     First, the vector addition:
>     >     >     >     >
>     >     >     >     >     vectoradd.chpl
>     >     >     >     >     --------------
>     >     >     >     >     use Time;
>     >     >     >     >     use Random;
>     >     >     >     >     use BlockDist;
>     >     >     >     >     //use VisualDebug;
>     >     >     >     >
>     >     >     >     >     config const n = 1024**3/2;
>     >     >     >     >
>     >     >     >     >     // for multi-locale
>     >     >     >     >     const ProblemDomain : domain(1) dmapped 
> Block(boundingBox = {0..#n})
>     >     >     >     >        = {0..#n};
>     >     >     >     >     // for single-locale
>     >     >     >     >     const ProblemDomain : domain(1) = {0..#n};
>     >     >     >     >
>     >     >     >     >     type float = real(32);
>     >     >     >     >
>     >     >     >     >     proc addNoDomain(c : [] float, a : [] float, b : 
> [] float) {
>     >     >     >     >        forall (ci, ai, bi) in zip(c, a, b) {
>     >     >     >     >          ci = ai + bi;
>     >     >     >     >        }
>     >     >     >     >     }
>     >     >     >     >
>     >     >     >     >     proc addZip(c : [ProblemDomain] float, a : 
> [ProblemDomain] float,
>     >     >     >     >           b : [ProblemDomain] float) {
>     >     >     >     >        forall (ci, ai, bi) in zip(c, a, b) {
>     >     >     >     >          ci = ai + bi;
>     >     >     >     >        }
>     >     >     >     >     }
>     >     >     >     >
>     >     >     >     >     proc addForall(c : [ProblemDomain] float, a : 
> [ProblemDomain] float,
>     >     >     >     >              b : [ProblemDomain] float) {
>     >     >     >     >        //startVdebug("vdata");
>     >     >     >     >        forall i in ProblemDomain {
>     >     >     >     >          c[i] = a[i] + b[i];
>     >     >     >     >        }
>     >     >     >     >        //stopVdebug();
>     >     >     >     >     }
>     >     >     >     >
>     >     >     >     >     proc addCollective(c : [ProblemDomain] float, a : 
> [ProblemDomain] float,
>     >     >     >     >                  b : [ProblemDomain] float) {
>     >     >     >     >        c = a + b;
>     >     >     >     >     }
>     >     >     >     >
>     >     >     >     >     proc output(t : Timer, n, testName) {
>     >     >     >     >        t.stop();
>     >     >     >     >        writeln(testName, " n: ", n);
>     >     >     >     >        writeln("Time: ", t.elapsed(), "s");
>     >     >     >     >        writeln("GFLOPS: ", n / t.elapsed() / 1e9, "");
>     >     >     >     >        writeln();
>     >     >     >     >        t.clear();
>     >     >     >     >     }
>     >     >     >     >
>     >     >     >     >     proc main() {
>     >     >     >     >        var c : [ProblemDomain] float;
>     >     >     >     >        var a : [ProblemDomain] float;
>     >     >     >     >        var b : [ProblemDomain] float;
>     >     >     >     >        var t : Timer;
>     >     >     >     >
>     >     >     >     >        fillRandom(a, 0);
>     >     >     >     >        fillRandom(b, 42);
>     >     >     >     >
>     >     >     >     >        t.start();
>     >     >     >     >        addNoDomain(c, a, b);
>     >     >     >     >        output(t, n, "addNoDomain");
>     >     >     >     >
>     >     >     >     >        t.start();
>     >     >     >     >        addZip(c, a, b);
>     >     >     >     >        output(t, n, "addZip");
>     >     >     >     >
>     >     >     >     >        t.start();
>     >     >     >     >        addForall(c, a, b);
>     >     >     >     >        output(t, n, "addForall");
>     >     >     >     >
>     >     >     >     >        t.start();
>     >     >     >     >        addCollective(c, a, b);
>     >     >     >     >        output(t, n, "addCollective");
>     >     >     >     >     }
>     >     >     >     >     -----
>     >     >     >     >
>     >     >     >     >     On a single locale I get as output:
>     >     >     >     >
>     >     >     >     >     addNoDomain n: 536870912
>     >     >     >     >     Time: 0.27961s
>     >     >     >     >     GFLOPS: 1.92007
>     >     >     >     >
>     >     >     >     >     addZip n: 536870912
>     >     >     >     >     Time: 0.278657s
>     >     >     >     >     GFLOPS: 1.92664
>     >     >     >     >
>     >     >     >     >     addForall n: 536870912
>     >     >     >     >     Time: 0.278015s
>     >     >     >     >     GFLOPS: 1.93109
>     >     >     >     >
>     >     >     >     >     addCollective n: 536870912
>     >     >     >     >     Time: 0.278379s
>     >     >     >     >     GFLOPS: 1.92856
>     >     >     >     >
>     >     >     >     >     On multi-locale (-nl 1) I get as output:
>     >     >     >     >
>     >     >     >     >     addNoDomain n: 536870912
>     >     >     >     >     Time: 2.16806s
>     >     >     >     >     GFLOPS: 0.247627
>     >     >     >     >
>     >     >     >     >     addZip n: 536870912
>     >     >     >     >     Time: 2.17024s
>     >     >     >     >     GFLOPS: 0.247378
>     >     >     >     >
>     >     >     >     >     addForall n: 536870912
>     >     >     >     >     Time: 4.78443s
>     >     >     >     >     GFLOPS: 0.112212
>     >     >     >     >
>     >     >     >     >     addCollective n: 536870912
>     >     >     >     >     Time: 2.19838s
>     >     >     >     >     GFLOPS: 0.244212
>     >     >     >     >
>     >     >     >     >     So, indeed, your suggestion improves it by more 
> than a factor two, but
>     >     >     >     >     it is still close to a factor 8 slower than 
> single-locale.
>     >     >     >     >
>     >     >     >     >     I also used chplvis and verified that there are 
> no gets and puts when
>     >     >     >     >     running multi-locale with more than one node.  
> The profiling information
>     >     >     >     >     is clear, but not very helpful (to me):
>     >     >     >     >
>     >     >     >     >     multi-locale (-nl 1):
>     >     >     >     >
>     >     >     >     >     | 65.3451 | wrapcoforall_fn_chpl5 | 
> vectoradd.chpl:26 |
>     >     >     >     >     |  4.8777 | wrapon_fn_chpl35      | 
> vectoradd.chpl:26 |
>     >     >     >     >
>     >     >     >     >     single-locale:
>     >     >     >     >
>     >     >     >     >     | 5.0019 | wrapcoforall_fn_chpl5 | 
> vectoradd.chpl:26 |
>     >     >     >     >
>     >     >     >     >
>     >     >     >     >
>     >     >     >     >     For stencil operations, I used the following 
> program:
>     >     >     >     >
>     >     >     >     >     1d-convolution.chpl
>     >     >     >     >     -------------------
>     >     >     >     >     use Time;
>     >     >     >     >     use Random;
>     >     >     >     >     use StencilDist;
>     >     >     >     >
>     >     >     >     >     config const n = 1024**3/2;
>     >     >     >     >
>     >     >     >     >     const ProblemDomain : domain(1) dmapped 
> Stencil(boundingBox = {0..#n},
>     >     >     >     >                                               fluff = 
> (1,))
>     >     >     >     >        = {0..#n};
>     >     >     >     >     const InnerDomain : subdomain(ProblemDomain) = 
> {1..n-2};
>     >     >     >     >
>     >     >     >     >     proc convolveIndices(output : [ProblemDomain] 
> real(32),
>     >     >     >     >                    input : [ProblemDomain] real(32)) {
>     >     >     >     >        forall i in InnerDomain {
>     >     >     >     >          output[i] = ((input[i-1] + input[i] + 
> input[i+1])/3:real(32));
>     >     >     >     >        }
>     >     >     >     >     }
>     >     >     >     >
>     >     >     >     >     proc convolveZip(output : [ProblemDomain] 
> real(32),
>     >     >     >     >                input : [ProblemDomain] real(32)) {
>     >     >     >     >        forall (im1, i, ip1) in 
> zip(InnerDomain.translate(-1),
>     >     >     >     >                             InnerDomain,
>     >     >     >     >                             InnerDomain.translate(1)) 
> {
>     >     >     >     >          output[i] = ((input[im1] + input[i] + 
> input[ip1])/3:real(32));
>     >     >     >     >        }
>     >     >     >     >     }
>     >     >     >     >
>     >     >     >     >     proc print(t : Timer, n, s) {
>     >     >     >     >        t.stop();
>     >     >     >     >        writeln(s, ", n: ", n);
>     >     >     >     >        writeln("Time: ", t.elapsed(), "s");
>     >     >     >     >        writeln("GFLOPS: ", 3 * n / 1e9 / t.elapsed());
>     >     >     >     >        writeln();
>     >     >     >     >        t.clear();
>     >     >     >     >     }
>     >     >     >     >
>     >     >     >     >     proc main() {
>     >     >     >     >        var input : [ProblemDomain] real(32);
>     >     >     >     >        var output : [ProblemDomain] real(32);
>     >     >     >     >        var t : Timer;
>     >     >     >     >
>     >     >     >     >        fillRandom(input, 42);
>     >     >     >     >
>     >     >     >     >        t.start();
>     >     >     >     >        convolveIndices(output, input);
>     >     >     >     >        print(t, n, "convolveIndices");
>     >     >     >     >
>     >     >     >     >        t.start();
>     >     >     >     >        convolveZip(output, input);
>     >     >     >     >        print(t, n, "convolveZip");
>     >     >     >     >     }
>     >     >     >     >     ------
>     >     >     >     >
>     >     >     >     >     Interestingly, in contrast to your earlier 
> suggestion, the direct
>     >     >     >     >     indexing works a bit better in this program than 
> the zipped version:
>     >     >     >     >
>     >     >     >     >     Multi-locale (-nl 1):
>     >     >     >     >
>     >     >     >     >     convolveIndices, n: 536870912
>     >     >     >     >     Time: 4.27148s
>     >     >     >     >     GFLOPS: 0.377062
>     >     >     >     >
>     >     >     >     >     convolveZip, n: 536870912
>     >     >     >     >     Time: 4.87291s
>     >     >     >     >     GFLOPS: 0.330524
>     >     >     >     >
>     >     >     >     >     Single-locale:
>     >     >     >     >
>     >     >     >     >     convolveIndices, n: 536870912
>     >     >     >     >     Time: 0.548804s
>     >     >     >     >     GFLOPS: 2.93477
>     >     >     >     >
>     >     >     >     >     convolveZip, n: 536870912
>     >     >     >     >     Time: 0.538754s
>     >     >     >     >     GFLOPS: 2.98951
>     >     >     >     >
>     >     >     >     >
>     >     >     >     >     Again, the multi-locale is about a factor 8 
> slower than single-locale.
>     >     >     >     >     By the way, the Stencil distribution is a bit 
> faster than the Block
>     >     >     >     >     distribution.
>     >     >     >     >
>     >     >     >     >     Thanks in advance for your input,
>     >     >     >     >
>     >     >     >     >     Pieter
>     >     >     >     >
>     >     >     >     >
>     >     >     >     >
>     >     >     >     >     On 24/10/16 19:20, Ben Harshbarger wrote:
>     >     >     >     >     > Hi Pieter,
>     >     >     >     >     >
>     >     >     >     >     > Thanks for providing the example, that's very 
> helpful.
>     >     >     >     >     >
>     >     >     >     >     > Multi-locale performance in Chapel is not yet 
> where we'd like it to be, but we've done a lot of work over the past few 
> releases to get cases like yours performing well. It's surprising that using 
> Block results in that much of a difference, but I think you would see better 
> performance by iterating over the arrays directly:
>     >     >     >     >     >
>     >     >     >     >     > ```
>     >     >     >     >     > // replace the loop in the 'add' function with 
> this:
>     >     >     >     >     > forall (ci, ai, bi) in zip(c, a, b) {
>     >     >     >     >     >   ci = ai + bi;
>     >     >     >     >     > }
>     >     >     >     >     > ```
>     >     >     >     >     >
>     >     >     >     >     > Block-distributed arrays can leverage the 
> fast-follower optimization to perform better when all arrays being iterated 
> over share the same domain. You can also write that loop in a cleaner way by 
> leveraging array promotion:
>     >     >     >     >     >
>     >     >     >     >     > ```
>     >     >     >     >     > // This is equivalent to the first loop
>     >     >     >     >     > c = a + b;
>     >     >     >     >     > ```
>     >     >     >     >     >
>     >     >     >     >     > However, when I tried the promoted variation on 
> my machine I observed worse performance than the explicit forall-loop. It 
> seems to be related to the way the arguments of the 'add' function are 
> declared. If you replaced "[ProblemDomain] float" with "[] float", 
> performance seems to improve. That surprised a couple of us on the 
> development team, and I'll be looking at that some more today.
>     >     >     >     >     >
>     >     >     >     >     > If you're still seeing significantly worse 
> performance with Block compared to the default rectangular domain, and the 
> programs are launched in the same way, that would be odd. You could try 
> profiling using chplvis. I agree though that there shouldn't be any 
> communication in this program. You can find more information on chplvis here 
> in the online 1.14 release documentation:
>     >     >     >     >     >
>     >     >     >     >     > 
> http://chapel.cray.com/docs/latest/tools/chplvis/chplvis.html
>     >     >     >     >     >
>     >     >     >     >     > I hope that rewriting the loops solves the 
> problem, but let us know if it doesn't and we can continue investigating.
>     >     >     >     >     >
>     >     >     >     >     > -Ben Harshbarger
>     >     >     >     >     >
>     >     >     >     >     > On 10/24/16, 6:19 AM, "Pieter Hijma" 
> <[email protected]> wrote:
>     >     >     >     >     >
>     >     >     >     >     >     Dear all,
>     >     >     >     >     >
>     >     >     >     >     >     My apologies if this has already been asked 
> before.  I'm new to the list
>     >     >     >     >     >     and couldn't find it in the archives.
>     >     >     >     >     >
>     >     >     >     >     >     I experience bad performance when running 
> the multi-locale compiled
>     >     >     >     >     >     version on an InfiniBand equiped cluster
>     >     >     >     >     >     (http://cs.vu.nl/das4/clusters.shtml, 
> VU-site), even with only one node.
>     >     >     >     >     >       Below you find a minimal example that 
> exhibits the same performance
>     >     >     >     >     >     problems as all my programs:
>     >     >     >     >     >
>     >     >     >     >     >     I compiled chapel-1.14.0 with the following 
> steps:
>     >     >     >     >     >
>     >     >     >     >     >     export CHPL_TARGET_ARCH=native
>     >     >     >     >     >     make -j
>     >     >     >     >     >     export CHPL_COMM=gasnet
>     >     >     >     >     >     export CHPL_COMM_SUBSTRATE=ibv
>     >     >     >     >     >     make clean
>     >     >     >     >     >     make -j
>     >     >     >     >     >
>     >     >     >     >     >     I compile the following Chapel code:
>     >     >     >     >     >
>     >     >     >     >     >     vectoradd.chpl:
>     >     >     >     >     >     ---------------
>     >     >     >     >     >     use Time;
>     >     >     >     >     >     use Random;
>     >     >     >     >     >     use BlockDist;
>     >     >     >     >     >
>     >     >     >     >     >     config const n = 1024**3;
>     >     >     >     >     >
>     >     >     >     >     >     // for single-locale
>     >     >     >     >     >     // const ProblemDomain : domain(1) = 
> {0..#n};
>     >     >     >     >     >     // for multi-locale
>     >     >     >     >     >     const ProblemDomain : domain(1) dmapped 
> Block(boundingBox = {0..#n}) =
>     >     >     >     >     >          {0..#n};
>     >     >     >     >     >
>     >     >     >     >     >     type float = real(32);
>     >     >     >     >     >
>     >     >     >     >     >     proc add(c : [ProblemDomain] float, a : 
> [ProblemDomain] float,
>     >     >     >     >     >          b : [ProblemDomain] float) {
>     >     >     >     >     >        forall i in ProblemDomain {
>     >     >     >     >     >          c[i] = a[i] + b[i];
>     >     >     >     >     >        }
>     >     >     >     >     >     }
>     >     >     >     >     >
>     >     >     >     >     >     proc main() {
>     >     >     >     >     >        var c : [ProblemDomain] float;
>     >     >     >     >     >        var a : [ProblemDomain] float;
>     >     >     >     >     >        var b : [ProblemDomain] float;
>     >     >     >     >     >        var t : Timer;
>     >     >     >     >     >
>     >     >     >     >     >        fillRandom(a, 0);
>     >     >     >     >     >        fillRandom(b, 42);
>     >     >     >     >     >
>     >     >     >     >     >        t.start();
>     >     >     >     >     >        add(c, a, b);
>     >     >     >     >     >        t.stop();
>     >     >     >     >     >
>     >     >     >     >     >        writeln("n: ", n);
>     >     >     >     >     >        writeln("Time: ", t.elapsed(), "s");
>     >     >     >     >     >        writeln("GFLOPS: ", n / t.elapsed() / 
> 1e9, "s");
>     >     >     >     >     >     }
>     >     >     >     >     >     ----
>     >     >     >     >     >
>     >     >     >     >     >     I compile this for single-locale with 
> (using no domain maps, see the
>     >     >     >     >     >     comment above in the source):
>     >     >     >     >     >
>     >     >     >     >     >     chpl -o vectoradd --fast vectoradd.chpl
>     >     >     >     >     >
>     >     >     >     >     >     I run it with (dual quad core with 2 
> hardware threads):
>     >     >     >     >     >
>     >     >     >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>     >     >     >     >     >     ./vectoradd
>     >     >     >     >     >
>     >     >     >     >     >     And get as output:
>     >     >     >     >     >
>     >     >     >     >     >     n: 1073741824
>     >     >     >     >     >     Time: 0.558806s
>     >     >     >     >     >     GFLOPS: 1.92149s
>     >     >     >     >     >
>     >     >     >     >     >     However, the performance for multi-locale 
> is much worse:
>     >     >     >     >     >
>     >     >     >     >     >     I compile this for multi-locale with domain 
> maps, see the comment in the
>     >     >     >     >     >     source):
>     >     >     >     >     >
>     >     >     >     >     >     CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv 
> chpl -o vectoradd --fast \
>     >     >     >     >     >        vectoradd.chpl
>     >     >     >     >     >
>     >     >     >     >     >     I run it on the same type of node with:
>     >     >     >     >     >
>     >     >     >     >     >     SSH_SERVERS=`uniq $TMPDIR/machines  | tr 
> '\n' ' '`
>     >     >     >     >     >
>     >     >     >     >     >     export GASNET_PHYSMEM_MAX=1G
>     >     >     >     >     >     export GASNET_IBV_SPAWNER=ssh
>     >     >     >     >     >     export GASNET_SSH_SERVERS="$SSH_SERVERS"
>     >     >     >     >     >
>     >     >     >     >     >     export CHPL_RT_NUM_THREADS_PER_LOCALE=16
>     >     >     >     >     >     export CHPL_LAUNCHER=gasnetrun_ibv
>     >     >     >     >     >     export CHPL_COMM=gasnet
>     >     >     >     >     >     export CHPL_COMM_SUBSTRATE=ibv
>     >     >     >     >     >
>     >     >     >     >     >     ./vectoradd -nl 1
>     >     >     >     >     >
>     >     >     >     >     >     And get as output:
>     >     >     >     >     >
>     >     >     >     >     >     n: 1073741824
>     >     >     >     >     >     Time: 8.65082s
>     >     >     >     >     >     GFLOPS: 0.12412s
>     >     >     >     >     >
>     >     >     >     >     >     I would understand a performance difference 
> of say 10% because of
>     >     >     >     >     >     multi-locale execution, but not factors.  
> Is this to be expected from
>     >     >     >     >     >     the current state of Chapel?  This 
> performance difference is examplary
>     >     >     >     >     >     for basically all my programs that also are 
> more realistic and use
>     >     >     >     >     >     larger inputs.  The performance is strange 
> as there is no communication
>     >     >     >     >     >     necessary (only one node) and the program 
> is using the same amount of
>     >     >     >     >     >     threads.
>     >     >     >     >     >
>     >     >     >     >     >     Is there any way for me to investigate this 
> using profiling for example?
>     >     >     >     >     >
>     >     >     >     >     >     By the way, the program does scale well to 
> multiple nodes (which is not
>     >     >     >     >     >     difficult given the baseline):
>     >     >     >     >     >
>     >     >     >     >     >       1 | 8.65s
>     >     >     >     >     >       2 | 2.67s
>     >     >     >     >     >       4 | 1.69s
>     >     >     >     >     >       8 | 0.87s
>     >     >     >     >     >     16 | 0.41s
>     >     >     >     >     >
>     >     >     >     >     >     Thanks in advance for your input.
>     >     >     >     >     >
>     >     >     >     >     >     Kind regards,
>     >     >     >     >     >
>     >     >     >     >     >     Pieter Hijma
>     >     >     >     >     >
>     >     >     >     >     >     
> ------------------------------------------------------------------------------
>     >     >     >     >     >     Check out the vibrant tech community on one 
> of the world's most
>     >     >     >     >     >     engaging tech sites, SlashDot.org! 
> http://sdm.link/slashdot
>     >     >     >     >     >     
> _______________________________________________
>     >     >     >     >     >     Chapel-users mailing list
>     >     >     >     >     >     [email protected]
>     >     >     >     >     >     
> https://lists.sourceforge.net/lists/listinfo/chapel-users
>     >     >     >     >     >
>     >     >     >     >     >
>     >     >     >     >
>     >     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
>     >
>
>

------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Performance multi-locale

Reply via email to