Re: [Chapel-developers] slow locale comm

Michael Ferguson Fri, 15 May 2015 06:22:04 -0700

A problem with make clean and different configurations isn't too
surprising.
By the way, you can always 'make clobber' to do even more cleaning.


-michael

On 5/14/15, 12:03 PM, "Brian Guarraci" <[email protected]> wrote:

>OK, so started from a clean distribution and CHPL_TARGET_ARCH=native is
>now running w/o the CHPL_MEM error.  I think there must be a bug
>somewhere in the make clean (if I were using a git repo, i would
> have used git clean -fdx and probably avoided this problem).  I'm glad
>it's a build issue.
>
>Now looking into perf issues w/ binary compiled with --fast.
>
>
>On Tue, May 12, 2015 at 8:12 PM, Brian Guarraci
><[email protected]> wrote:
>
>Yesterday, I did multiple full clean builds (for various combos)
>suspecting what you suggest.  I think there are some bugs and I need to
>dig in to provide better clues for this list.  I'm using 1.11.0 official
>build.
>
>One strange symptom was that using -nl 1 showed this error but -nl 16
>just hung indefinitely, with no cpu or network activity.  Tried this with
>no optimizations as well, same result.  Could also be partly related to
>the weird gasnet issue I mentioned.
>
>> On May 12, 2015, at 7:40 PM, Michael Ferguson <[email protected]>
>>wrote:
>>
>> Hi Brian -
>>
>> I've seen errors like that if I don't run 'make' again in the
>> compiler directory after setting the CHPL_TARGET_ARCH environment
>> variable. You need to run 'make' again since the runtime builds
>> with the target architecture setting.
>>
>> Of course this and the CHPL_MEM errors could be bugs...
>>
>> Hope that helps,
>>
>> -michael
>>
>>> On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote:
>>>
>>> quick, partial, follow up:
>>>
>>> recompiled chapel with the CHPL_TARGET_ARCH=native and hit the
>>>following
>>> error:
>>>
>>> /home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19:
>>> 
>>>/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-na
>>>ti
>>> ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file
>>>or
>>> directory
>>> make: *** No rule to make target
>>> 
>>>`/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-n
>>>at
>>> ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop.
>>> error: compiling generated source
>>>
>>> I tweaked the gastnet folder names, but never got it to work yet.
>>> Subsequent attempts to run the compiled program resulted in "set
>>>CHPL_MEM
>>> to a more appropriate mem type".  I tried setting the mem type to a few
>>> different choices and didn't help.  Needs more
>>> investigation.
>>>
>>>
>>> I did managed to run some local --fast tests though and the --local
>>> version ran in 5s while the --no-local version ran in about 12s.
>>>
>>> Additionally, I played around with the local keyword and that also
>>>seems
>>> to make some difference, at least locally.  I need to try this on the
>>> distributed version when I stand it back up.
>>>
>>>> On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]>
>>>>wrote:
>>>>
>>>>
>>>> For well-written programs, most of the --local vs. --no-local
>>>> differences show up as CPU overhead rather than network overhead.
>>>>I.e.,
>>>> we tend not to do unnecessary communications, we simply execute extra
>>>> scalar code to determine that communication is unnecessary,
>>> and the presence of this code hinders the back-end compiler's ability
>>>to
>>> optimize the per-node computation.
>>>>
>>>> Here's an example:  A given array access like A[i] may not know
>>>>whether
>>>> the access is local or remote, so will introduce communication-related
>>>> code to disambiguate.  Even if that code doesn't generate
>>>>communication,
>>>> it can be ugly enough to throw the back-end
>>> C compiler off.
>>>>
>>>> Some workarounds to deal with this:
>>>>
>>>> * the 'local' block (documented in doc/release/technotes/README.local
>>>>  -- this is a big hammer and likely to be replaced with more data-
>>>>  centric capabilities going forward, but can be helpful in the
>>>>  meantime if you can get it working in a chunk of code.
>>>>
>>>> * I don't know how fully-fleshed out these features are, but there
>>>>  are at least draft capabilities for .localAccess() and .localSlice()
>>>>  methods on some array types to reduce overheads like the ones in
>>>>  my simple example above.  I.e., if I know that A[i] is local for a
>>>>  given distributed array, A.localAccess(i) is likely to give better
>>>>  performance.
>>>>
>>>>
>>>> But maybe I should start with a higher-level question:  What kinds of
>>>> data structures does your code use, and what types of idioms do you
>>>>use
>>>> to get multi-locale executions going?  (e.g., distributed arrays +
>>>> foralls? Or more manually-distributed data structures
>>> + task parallelism + on-clauses?)
>>>>
>>>> Thanks,
>>>>
>>>> -Brad
>>>>
>>>>
>>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>>>>
>>>>> I was aware of the on-going progress in optimizing the comm, but I'll
>>>>> take
>>>>> a look at your docs.  I'll also give the -local vs --no-local
>>>>> experiment a
>>>>> try.
>>>>>
>>>>> I tested the network layer and saw my nodes were operating near peak
>>>>> network capacity so it wasn't a transport issue.  Yes, the CPUs (of
>>>>> which
>>>>> there are 4 per node) were nearly fully pegged.  Considering the
>>>>>level
>>>>> of
>>>>> complexity of the code, I suspect it was mostly overhead.  I even was
>>>>> looking for a way to pin the execution to a single proc as I wonder
>>>>>if
>>>>> there was some kind of thrashing going on between procs.  The funny
>>>>> thing
>>>>> was the more I tried to optimize the program to do less network
>>>>> traffic,
>>>>> the slower it got.
>>>>>
>>>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi Brian --
>>>>>>
>>>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you
>>>>>> better
>>>>>> results as long as you're not cross-compiling.  Alternatively, you
>>>>>> can set
>>>>>> it to 'none' which will squash the warning you're getting.  In any
>>>>>> case, I
>>>>>> wouldn't expect the lack of --specialize optimizations to be the
>>>>>> problem
>>>>>> here (but if you're throwing components of --fast manually, you'd
>>>>>> want to
>>>>>> be sure to add -O in addition to --no-checks).
>>>>>>
>>>>>> Generally speaking, Chapel programs compiled for --no-local
>>>>>> (multi-locale
>>>>>> execution) tend to generate much worse per-node code than those
>>>>>> compiled
>>>>>> for --local (single-locale execution), and this is an area of active
>>>>>> optimization effort.  See the "Performance Optimizations and
>>>>>> Generated Code
>>>>>> Improvements" release note slides at:
>>>>>>
>>>>>>        
>http://chapel.cray.com/download.html#releaseNotes
><http://chapel.cray.com/download.html#releaseNotes>
>>> 
>http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf
><http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf>
>>> <http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf>
>>>>>>
>>>>>> and particularly, the section entitled "the 'local field' pragma"
>>>>>>for
>>>>>> more
>>>>>> details on this effort (starts at slide 34).
>>>>>>
>>>>>> In a nutshell, the Chapel compiler conservatively assumes that
>>>>>>things
>>>>>> are
>>>>>> remote rather than local when in doubt (to emphasize correctness
>>>>>>over
>>>>>> fast
>>>>>> but incorrect programs), and then gets into doubt far more often
>>>>>>than
>>>>>> it
>>>>>> should.  We're currently working on tightening up this gap.
>>>>>>
>>>>>> This could explain the full difference in performance that you're
>>>>>> seeing,
>>>>>> or something else may be happening.  One way to check into this
>>>>>>might
>>>>>> be to
>>>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see
>>>>>>how
>>>>>> much
>>>>>> overhead is added.  The fact that all CPUs are pegged is a good
>>>>>> indication
>>>>>> that you don't have a problem with load balance or distributing
>>>>>> data/computation across nodes, I'd guess?
>>>>>>
>>>>>> -Brad
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>>>>>
>>>>>> I should add that I did supply --no-checks and that helped about
>>>>>>10%.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> It says:
>>>>>>>>
>>>>>>>>
>>>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'.
>>>>>>>> If you
>>>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a
>>>>>>>> proper
>>>>>>>> value.
>>>>>>>> It's unclear which target arch is appropriate.
>>>>>>>>
>>>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Brian --
>>>>>>>>>
>>>>>>>>> Getting --fast working should definitely be the first priority.
>>>>>>>>> What
>>>>>>>>> about it fails to work?
>>>>>>>>>
>>>>>>>>> -Brad
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've been testing my search index on my 16 node ARM system and
>>>>>>>>>> have
>>>>>>>>>> been
>>>>>>>>>> running into some strange behavior.  The cool part is that the
>>>>>>>>>> locale
>>>>>>>>>> partitioning concept seems to work well, the downside is that
>>>>>>>>>>the
>>>>>>>>>> system
>>>>>>>>>> is
>>>>>>>>>> very slow.  I've rewritten the approach a few different ways and
>>>>>>>>>> haven't
>>>>>>>>>> made a dent, so wanted to ask a few questions.
>>>>>>>>>>
>>>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize
>>>>>>>>>> (--fast
>>>>>>>>>> doesn't work).  Is this going to significantly affect
>>>>>>>>>>cross-locale
>>>>>>>>>> performance?
>>>>>>>>>>
>>>>>>>>>> I've looked at the generated C code and tried to minimize the
>>>>>>>>>> _comm_
>>>>>>>>>> operations in core methods, but doesn't seem to help.  Network
>>>>>>>>>> usage is
>>>>>>>>>> still quite low (100K/s) while CPUs are pegged.  Are there any
>>>>>>>>>> profiling
>>>>>>>>>> tools I can use to understand what might be going on here?
>>>>>>>>>>
>>>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM
>>>>>>>>>> records
>>>>>>>>>> in
>>>>>>>>>> under 10s.  With 16 nodes, it takes 10min to do 100k records.
>>>>>>>>>>
>>>>>>>>>> Wondering if there's some systemic issue at play here and how
>>>>>>>>>>can
>>>>>>>>>> further
>>>>>>>>>> investigate.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> Brian
>>
>
>
>
>
>
>
>


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to