Re: [Chapel-developers] slow locale comm

Brad Chamberlain Thu, 14 May 2015 11:59:17 -0700

Hi Brian --

Note that there's a 'make clobber' in addition to 'make clean'.  In 
principle, 'make clean' is meant to remove unnecessary intermediate files 
but leave things in a working condition while 'make clobber' is meant to 
return things close(r) to what's in the repo.  I've been stuck in meetings 
the past few days so haven't even had a chance to read this thread yet, 
though it does sound, on the surface, like there may be a bug.  Maybe in 
'make clean' but perhaps more likely in calculating dependences.


-Brad


On Thu, 14 May 2015, Brian Guarraci wrote:

> OK, so started from a clean distribution and CHPL_TARGET_ARCH=native is now
> running w/o the CHPL_MEM error.  I think there must be a bug somewhere in
> the make clean (if I were using a git repo, i would have used git clean
> -fdx and probably avoided this problem).  I'm glad it's a build issue.
>
> Now looking into perf issues w/ binary compiled with --fast.
>
> On Tue, May 12, 2015 at 8:12 PM, Brian Guarraci <[email protected]> wrote:
>
>> Yesterday, I did multiple full clean builds (for various combos)
>> suspecting what you suggest.  I think there are some bugs and I need to dig
>> in to provide better clues for this list.  I'm using 1.11.0 official build.
>>
>> One strange symptom was that using -nl 1 showed this error but -nl 16 just
>> hung indefinitely, with no cpu or network activity.  Tried this with no
>> optimizations as well, same result.  Could also be partly related to the
>> weird gasnet issue I mentioned.
>>
>>> On May 12, 2015, at 7:40 PM, Michael Ferguson <[email protected]>
>> wrote:
>>>
>>> Hi Brian -
>>>
>>> I've seen errors like that if I don't run 'make' again in the
>>> compiler directory after setting the CHPL_TARGET_ARCH environment
>>> variable. You need to run 'make' again since the runtime builds
>>> with the target architecture setting.
>>>
>>> Of course this and the CHPL_MEM errors could be bugs...
>>>
>>> Hope that helps,
>>>
>>> -michael
>>>
>>>> On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote:
>>>>
>>>> quick, partial, follow up:
>>>>
>>>> recompiled chapel with the CHPL_TARGET_ARCH=native and hit the following
>>>> error:
>>>>
>>>> /home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19:
>>>>
>> /home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nati
>>>> ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file or
>>>> directory
>>>> make: *** No rule to make target
>>>>
>> `/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nat
>>>> ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop.
>>>> error: compiling generated source
>>>>
>>>> I tweaked the gastnet folder names, but never got it to work yet.
>>>> Subsequent attempts to run the compiled program resulted in "set
>> CHPL_MEM
>>>> to a more appropriate mem type".  I tried setting the mem type to a few
>>>> different choices and didn't help.  Needs more
>>>> investigation.
>>>>
>>>>
>>>> I did managed to run some local --fast tests though and the --local
>>>> version ran in 5s while the --no-local version ran in about 12s.
>>>>
>>>> Additionally, I played around with the local keyword and that also seems
>>>> to make some difference, at least locally.  I need to try this on the
>>>> distributed version when I stand it back up.
>>>>
>>>>> On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]>
>> wrote:
>>>>>
>>>>>
>>>>> For well-written programs, most of the --local vs. --no-local
>>>>> differences show up as CPU overhead rather than network overhead.
>> I.e.,
>>>>> we tend not to do unnecessary communications, we simply execute extra
>>>>> scalar code to determine that communication is unnecessary,
>>>> and the presence of this code hinders the back-end compiler's ability to
>>>> optimize the per-node computation.
>>>>>
>>>>> Here's an example:  A given array access like A[i] may not know whether
>>>>> the access is local or remote, so will introduce communication-related
>>>>> code to disambiguate.  Even if that code doesn't generate
>> communication,
>>>>> it can be ugly enough to throw the back-end
>>>> C compiler off.
>>>>>
>>>>> Some workarounds to deal with this:
>>>>>
>>>>> * the 'local' block (documented in doc/release/technotes/README.local
>>>>>  -- this is a big hammer and likely to be replaced with more data-
>>>>>  centric capabilities going forward, but can be helpful in the
>>>>>  meantime if you can get it working in a chunk of code.
>>>>>
>>>>> * I don't know how fully-fleshed out these features are, but there
>>>>>  are at least draft capabilities for .localAccess() and .localSlice()
>>>>>  methods on some array types to reduce overheads like the ones in
>>>>>  my simple example above.  I.e., if I know that A[i] is local for a
>>>>>  given distributed array, A.localAccess(i) is likely to give better
>>>>>  performance.
>>>>>
>>>>>
>>>>> But maybe I should start with a higher-level question:  What kinds of
>>>>> data structures does your code use, and what types of idioms do you use
>>>>> to get multi-locale executions going?  (e.g., distributed arrays +
>>>>> foralls? Or more manually-distributed data structures
>>>> + task parallelism + on-clauses?)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Brad
>>>>>
>>>>>
>>>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>>>>>
>>>>>> I was aware of the on-going progress in optimizing the comm, but I'll
>>>>>> take
>>>>>> a look at your docs.  I'll also give the -local vs --no-local
>>>>>> experiment a
>>>>>> try.
>>>>>>
>>>>>> I tested the network layer and saw my nodes were operating near peak
>>>>>> network capacity so it wasn't a transport issue.  Yes, the CPUs (of
>>>>>> which
>>>>>> there are 4 per node) were nearly fully pegged.  Considering the level
>>>>>> of
>>>>>> complexity of the code, I suspect it was mostly overhead.  I even was
>>>>>> looking for a way to pin the execution to a single proc as I wonder if
>>>>>> there was some kind of thrashing going on between procs.  The funny
>>>>>> thing
>>>>>> was the more I tried to optimize the program to do less network
>>>>>> traffic,
>>>>>> the slower it got.
>>>>>>
>>>>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Brian --
>>>>>>>
>>>>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you
>>>>>>> better
>>>>>>> results as long as you're not cross-compiling.  Alternatively, you
>>>>>>> can set
>>>>>>> it to 'none' which will squash the warning you're getting.  In any
>>>>>>> case, I
>>>>>>> wouldn't expect the lack of --specialize optimizations to be the
>>>>>>> problem
>>>>>>> here (but if you're throwing components of --fast manually, you'd
>>>>>>> want to
>>>>>>> be sure to add -O in addition to --no-checks).
>>>>>>>
>>>>>>> Generally speaking, Chapel programs compiled for --no-local
>>>>>>> (multi-locale
>>>>>>> execution) tend to generate much worse per-node code than those
>>>>>>> compiled
>>>>>>> for --local (single-locale execution), and this is an area of active
>>>>>>> optimization effort.  See the "Performance Optimizations and
>>>>>>> Generated Code
>>>>>>> Improvements" release note slides at:
>>>>>>>
>>>>>>>        http://chapel.cray.com/download.html#releaseNotes
>>>> http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf
>>>> <http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf>
>>>>>>>
>>>>>>> and particularly, the section entitled "the 'local field' pragma" for
>>>>>>> more
>>>>>>> details on this effort (starts at slide 34).
>>>>>>>
>>>>>>> In a nutshell, the Chapel compiler conservatively assumes that things
>>>>>>> are
>>>>>>> remote rather than local when in doubt (to emphasize correctness over
>>>>>>> fast
>>>>>>> but incorrect programs), and then gets into doubt far more often than
>>>>>>> it
>>>>>>> should.  We're currently working on tightening up this gap.
>>>>>>>
>>>>>>> This could explain the full difference in performance that you're
>>>>>>> seeing,
>>>>>>> or something else may be happening.  One way to check into this might
>>>>>>> be to
>>>>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see how
>>>>>>> much
>>>>>>> overhead is added.  The fact that all CPUs are pegged is a good
>>>>>>> indication
>>>>>>> that you don't have a problem with load balance or distributing
>>>>>>> data/computation across nodes, I'd guess?
>>>>>>>
>>>>>>> -Brad
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>>>>>>
>>>>>>> I should add that I did supply --no-checks and that helped about 10%.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> It says:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'.
>>>>>>>>> If you
>>>>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a
>>>>>>>>> proper
>>>>>>>>> value.
>>>>>>>>> It's unclear which target arch is appropriate.
>>>>>>>>>
>>>>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hi Brian --
>>>>>>>>>>
>>>>>>>>>> Getting --fast working should definitely be the first priority.
>>>>>>>>>> What
>>>>>>>>>> about it fails to work?
>>>>>>>>>>
>>>>>>>>>> -Brad
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've been testing my search index on my 16 node ARM system and
>>>>>>>>>>> have
>>>>>>>>>>> been
>>>>>>>>>>> running into some strange behavior.  The cool part is that the
>>>>>>>>>>> locale
>>>>>>>>>>> partitioning concept seems to work well, the downside is that the
>>>>>>>>>>> system
>>>>>>>>>>> is
>>>>>>>>>>> very slow.  I've rewritten the approach a few different ways and
>>>>>>>>>>> haven't
>>>>>>>>>>> made a dent, so wanted to ask a few questions.
>>>>>>>>>>>
>>>>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize
>>>>>>>>>>> (--fast
>>>>>>>>>>> doesn't work).  Is this going to significantly affect
>> cross-locale
>>>>>>>>>>> performance?
>>>>>>>>>>>
>>>>>>>>>>> I've looked at the generated C code and tried to minimize the
>>>>>>>>>>> _comm_
>>>>>>>>>>> operations in core methods, but doesn't seem to help.  Network
>>>>>>>>>>> usage is
>>>>>>>>>>> still quite low (100K/s) while CPUs are pegged.  Are there any
>>>>>>>>>>> profiling
>>>>>>>>>>> tools I can use to understand what might be going on here?
>>>>>>>>>>>
>>>>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM
>>>>>>>>>>> records
>>>>>>>>>>> in
>>>>>>>>>>> under 10s.  With 16 nodes, it takes 10min to do 100k records.
>>>>>>>>>>>
>>>>>>>>>>> Wondering if there's some systemic issue at play here and how can
>>>>>>>>>>> further
>>>>>>>>>>> investigate.
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>> Brian
>>>
>>
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to