Re: [Chapel-developers] slow locale comm

Brian Guarraci Thu, 14 May 2015 19:15:08 -0700

Following up on this a bit:

I've redesigned the program to load data from each locale directly as
opposed to using network traffic to populate the world.  It loads the
entire index across all partitions in about 3s.  The app that fans the data
out across partitions is super slow though, so I guess network traffic is
going to take time to figure out here.  I eventually used rsync to move the
data around and then boot the app and that worked well.


I'm happy to help diagnose whatever I can on the network traffic side.  If
anyone is curious, you can see my code here:

fanout:
https://github.com/briangu/libev-examples/blob/master/chapel/fanout.chpl

search:
https://github.com/briangu/libev-examples/blob/master/chapel/search.chpl

Now that I have a usable data-load scenario, I'll probably move on to the
parts of my problem that have been blocked...

Thanks!!



On Thu, May 14, 2015 at 11:58 AM, Brad Chamberlain <[email protected]> wrote:

>
> Hi Brian --
>
> Note that there's a 'make clobber' in addition to 'make clean'.  In
> principle, 'make clean' is meant to remove unnecessary intermediate files
> but leave things in a working condition while 'make clobber' is meant to
> return things close(r) to what's in the repo.  I've been stuck in meetings
> the past few days so haven't even had a chance to read this thread yet,
> though it does sound, on the surface, like there may be a bug.  Maybe in
> 'make clean' but perhaps more likely in calculating dependences.
>
> -Brad
>
>
>
> On Thu, 14 May 2015, Brian Guarraci wrote:
>
>  OK, so started from a clean distribution and CHPL_TARGET_ARCH=native is
>> now
>> running w/o the CHPL_MEM error.  I think there must be a bug somewhere in
>> the make clean (if I were using a git repo, i would have used git clean
>> -fdx and probably avoided this problem).  I'm glad it's a build issue.
>>
>> Now looking into perf issues w/ binary compiled with --fast.
>>
>> On Tue, May 12, 2015 at 8:12 PM, Brian Guarraci <[email protected]> wrote:
>>
>>  Yesterday, I did multiple full clean builds (for various combos)
>>> suspecting what you suggest.  I think there are some bugs and I need to
>>> dig
>>> in to provide better clues for this list.  I'm using 1.11.0 official
>>> build.
>>>
>>> One strange symptom was that using -nl 1 showed this error but -nl 16
>>> just
>>> hung indefinitely, with no cpu or network activity.  Tried this with no
>>> optimizations as well, same result.  Could also be partly related to the
>>> weird gasnet issue I mentioned.
>>>
>>>  On May 12, 2015, at 7:40 PM, Michael Ferguson <[email protected]>
>>>>
>>> wrote:
>>>
>>>>
>>>> Hi Brian -
>>>>
>>>> I've seen errors like that if I don't run 'make' again in the
>>>> compiler directory after setting the CHPL_TARGET_ARCH environment
>>>> variable. You need to run 'make' again since the runtime builds
>>>> with the target architecture setting.
>>>>
>>>> Of course this and the CHPL_MEM errors could be bugs...
>>>>
>>>> Hope that helps,
>>>>
>>>> -michael
>>>>
>>>>  On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote:
>>>>>
>>>>> quick, partial, follow up:
>>>>>
>>>>> recompiled chapel with the CHPL_TARGET_ARCH=native and hit the
>>>>> following
>>>>> error:
>>>>>
>>>>> /home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19:
>>>>>
>>>>>
>>> /home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nati
>>>
>>>> ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file or
>>>>> directory
>>>>> make: *** No rule to make target
>>>>>
>>>>>
>>> `/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nat
>>>
>>>> ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop.
>>>>> error: compiling generated source
>>>>>
>>>>> I tweaked the gastnet folder names, but never got it to work yet.
>>>>> Subsequent attempts to run the compiled program resulted in "set
>>>>>
>>>> CHPL_MEM
>>>
>>>> to a more appropriate mem type".  I tried setting the mem type to a few
>>>>> different choices and didn't help.  Needs more
>>>>> investigation.
>>>>>
>>>>>
>>>>> I did managed to run some local --fast tests though and the --local
>>>>> version ran in 5s while the --no-local version ran in about 12s.
>>>>>
>>>>> Additionally, I played around with the local keyword and that also
>>>>> seems
>>>>> to make some difference, at least locally.  I need to try this on the
>>>>> distributed version when I stand it back up.
>>>>>
>>>>>  On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]>
>>>>>>
>>>>> wrote:
>>>
>>>>
>>>>>>
>>>>>> For well-written programs, most of the --local vs. --no-local
>>>>>> differences show up as CPU overhead rather than network overhead.
>>>>>>
>>>>> I.e.,
>>>
>>>> we tend not to do unnecessary communications, we simply execute extra
>>>>>> scalar code to determine that communication is unnecessary,
>>>>>>
>>>>> and the presence of this code hinders the back-end compiler's ability
>>>>> to
>>>>> optimize the per-node computation.
>>>>>
>>>>>>
>>>>>> Here's an example:  A given array access like A[i] may not know
>>>>>> whether
>>>>>> the access is local or remote, so will introduce communication-related
>>>>>> code to disambiguate.  Even if that code doesn't generate
>>>>>>
>>>>> communication,
>>>
>>>> it can be ugly enough to throw the back-end
>>>>>>
>>>>> C compiler off.
>>>>>
>>>>>>
>>>>>> Some workarounds to deal with this:
>>>>>>
>>>>>> * the 'local' block (documented in doc/release/technotes/README.local
>>>>>>  -- this is a big hammer and likely to be replaced with more data-
>>>>>>  centric capabilities going forward, but can be helpful in the
>>>>>>  meantime if you can get it working in a chunk of code.
>>>>>>
>>>>>> * I don't know how fully-fleshed out these features are, but there
>>>>>>  are at least draft capabilities for .localAccess() and .localSlice()
>>>>>>  methods on some array types to reduce overheads like the ones in
>>>>>>  my simple example above.  I.e., if I know that A[i] is local for a
>>>>>>  given distributed array, A.localAccess(i) is likely to give better
>>>>>>  performance.
>>>>>>
>>>>>>
>>>>>> But maybe I should start with a higher-level question:  What kinds of
>>>>>> data structures does your code use, and what types of idioms do you
>>>>>> use
>>>>>> to get multi-locale executions going?  (e.g., distributed arrays +
>>>>>> foralls? Or more manually-distributed data structures
>>>>>>
>>>>> + task parallelism + on-clauses?)
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Brad
>>>>>>
>>>>>>
>>>>>>  On Mon, 11 May 2015, Brian Guarraci wrote:
>>>>>>>
>>>>>>> I was aware of the on-going progress in optimizing the comm, but I'll
>>>>>>> take
>>>>>>> a look at your docs.  I'll also give the -local vs --no-local
>>>>>>> experiment a
>>>>>>> try.
>>>>>>>
>>>>>>> I tested the network layer and saw my nodes were operating near peak
>>>>>>> network capacity so it wasn't a transport issue.  Yes, the CPUs (of
>>>>>>> which
>>>>>>> there are 4 per node) were nearly fully pegged.  Considering the
>>>>>>> level
>>>>>>> of
>>>>>>> complexity of the code, I suspect it was mostly overhead.  I even was
>>>>>>> looking for a way to pin the execution to a single proc as I wonder
>>>>>>> if
>>>>>>> there was some kind of thrashing going on between procs.  The funny
>>>>>>> thing
>>>>>>> was the more I tried to optimize the program to do less network
>>>>>>> traffic,
>>>>>>> the slower it got.
>>>>>>>
>>>>>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi Brian --
>>>>>>>>
>>>>>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you
>>>>>>>> better
>>>>>>>> results as long as you're not cross-compiling.  Alternatively, you
>>>>>>>> can set
>>>>>>>> it to 'none' which will squash the warning you're getting.  In any
>>>>>>>> case, I
>>>>>>>> wouldn't expect the lack of --specialize optimizations to be the
>>>>>>>> problem
>>>>>>>> here (but if you're throwing components of --fast manually, you'd
>>>>>>>> want to
>>>>>>>> be sure to add -O in addition to --no-checks).
>>>>>>>>
>>>>>>>> Generally speaking, Chapel programs compiled for --no-local
>>>>>>>> (multi-locale
>>>>>>>> execution) tend to generate much worse per-node code than those
>>>>>>>> compiled
>>>>>>>> for --local (single-locale execution), and this is an area of active
>>>>>>>> optimization effort.  See the "Performance Optimizations and
>>>>>>>> Generated Code
>>>>>>>> Improvements" release note slides at:
>>>>>>>>
>>>>>>>>        http://chapel.cray.com/download.html#releaseNotes
>>>>>>>>
>>>>>>> http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf
>>>>> <http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf>
>>>>>
>>>>>>
>>>>>>>> and particularly, the section entitled "the 'local field' pragma"
>>>>>>>> for
>>>>>>>> more
>>>>>>>> details on this effort (starts at slide 34).
>>>>>>>>
>>>>>>>> In a nutshell, the Chapel compiler conservatively assumes that
>>>>>>>> things
>>>>>>>> are
>>>>>>>> remote rather than local when in doubt (to emphasize correctness
>>>>>>>> over
>>>>>>>> fast
>>>>>>>> but incorrect programs), and then gets into doubt far more often
>>>>>>>> than
>>>>>>>> it
>>>>>>>> should.  We're currently working on tightening up this gap.
>>>>>>>>
>>>>>>>> This could explain the full difference in performance that you're
>>>>>>>> seeing,
>>>>>>>> or something else may be happening.  One way to check into this
>>>>>>>> might
>>>>>>>> be to
>>>>>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see
>>>>>>>> how
>>>>>>>> much
>>>>>>>> overhead is added.  The fact that all CPUs are pegged is a good
>>>>>>>> indication
>>>>>>>> that you don't have a problem with load balance or distributing
>>>>>>>> data/computation across nodes, I'd guess?
>>>>>>>>
>>>>>>>> -Brad
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>>>>>>>>
>>>>>>>> I should add that I did supply --no-checks and that helped about
>>>>>>>> 10%.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> It says:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'.
>>>>>>>>>> If you
>>>>>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a
>>>>>>>>>> proper
>>>>>>>>>> value.
>>>>>>>>>> It's unclear which target arch is appropriate.
>>>>>>>>>>
>>>>>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]
>>>>>>>>>> >
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Hi Brian --
>>>>>>>>>>>
>>>>>>>>>>> Getting --fast working should definitely be the first priority.
>>>>>>>>>>> What
>>>>>>>>>>> about it fails to work?
>>>>>>>>>>>
>>>>>>>>>>> -Brad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I've been testing my search index on my 16 node ARM system and
>>>>>>>>>>>> have
>>>>>>>>>>>> been
>>>>>>>>>>>> running into some strange behavior.  The cool part is that the
>>>>>>>>>>>> locale
>>>>>>>>>>>> partitioning concept seems to work well, the downside is that
>>>>>>>>>>>> the
>>>>>>>>>>>> system
>>>>>>>>>>>> is
>>>>>>>>>>>> very slow.  I've rewritten the approach a few different ways and
>>>>>>>>>>>> haven't
>>>>>>>>>>>> made a dent, so wanted to ask a few questions.
>>>>>>>>>>>>
>>>>>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize
>>>>>>>>>>>> (--fast
>>>>>>>>>>>> doesn't work).  Is this going to significantly affect
>>>>>>>>>>>>
>>>>>>>>>>> cross-locale
>>>
>>>> performance?
>>>>>>>>>>>>
>>>>>>>>>>>> I've looked at the generated C code and tried to minimize the
>>>>>>>>>>>> _comm_
>>>>>>>>>>>> operations in core methods, but doesn't seem to help.  Network
>>>>>>>>>>>> usage is
>>>>>>>>>>>> still quite low (100K/s) while CPUs are pegged.  Are there any
>>>>>>>>>>>> profiling
>>>>>>>>>>>> tools I can use to understand what might be going on here?
>>>>>>>>>>>>
>>>>>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM
>>>>>>>>>>>> records
>>>>>>>>>>>> in
>>>>>>>>>>>> under 10s.  With 16 nodes, it takes 10min to do 100k records.
>>>>>>>>>>>>
>>>>>>>>>>>> Wondering if there's some systemic issue at play here and how
>>>>>>>>>>>> can
>>>>>>>>>>>> further
>>>>>>>>>>>> investigate.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> Brian
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>
>>>
>>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to