Re: [Chapel-developers] slow locale comm

Brian Guarraci Thu, 14 May 2015 11:20:06 -0700

I've managed to create a simple example which seems to encapsulate the
general, but ultra-simplified, access patterns that my search index is
doing.  The example runs about as slow and is basically not doing much
other than allocating nodes on a linked list and incrementing an atomic.
I'd really love to know what naive thing I'm doing that turns out to be
really slow - and what I can do about it.


https://github.com/briangu/libev-examples/blob/_bg_p2/chapel/crosstalk.chpl

use Logging, Memory, IO, Partitions, Time;

class Node {
  var word: string;
  var next: Node;
}

class PartitionInfo {
  var head: Node;
  var count: atomic int;
}

class WordIndex {
  var wordIndex: [0..Partitions.size-1] PartitionInfo;

  proc WordIndex() {
    for i in wordIndex.domain {
      on Partitions[i] {
        writeln("adding ", i);
        wordIndex[i] = new PartitionInfo();
      }
    }
  }

  proc indexWord(word: string) {
    var partition = partitionForWord(word);
    var info = wordIndex[partition];
    on info {
      info.head = new Node(word, info.head);
      info.count.add(1);
    }
  }
}

proc main() {
  initPartitions();

  var wordIndex = new WordIndex();

  var t: Timer;
  t.start();

  var infile = open("words.txt", iomode.r);
  var reader = infile.reader();
  var word: string;
  while (reader.readln(word)) {
    wordIndex.indexWord(word);
  }

  t.stop();
  timing("indexing complete in ",t.elapsed(TimeUnits.microseconds), "
microseconds");
}


On Thu, May 14, 2015 at 10:03 AM, Brian Guarraci <[email protected]> wrote:

> OK, so started from a clean distribution and CHPL_TARGET_ARCH=native is
> now running w/o the CHPL_MEM error.  I think there must be a bug somewhere
> in the make clean (if I were using a git repo, i would have used git clean
> -fdx and probably avoided this problem).  I'm glad it's a build issue.
>
> Now looking into perf issues w/ binary compiled with --fast.
>
> On Tue, May 12, 2015 at 8:12 PM, Brian Guarraci <[email protected]> wrote:
>
>> Yesterday, I did multiple full clean builds (for various combos)
>> suspecting what you suggest.  I think there are some bugs and I need to dig
>> in to provide better clues for this list.  I'm using 1.11.0 official build.
>>
>> One strange symptom was that using -nl 1 showed this error but -nl 16
>> just hung indefinitely, with no cpu or network activity.  Tried this with
>> no optimizations as well, same result.  Could also be partly related to the
>> weird gasnet issue I mentioned.
>>
>> > On May 12, 2015, at 7:40 PM, Michael Ferguson <[email protected]>
>> wrote:
>> >
>> > Hi Brian -
>> >
>> > I've seen errors like that if I don't run 'make' again in the
>> > compiler directory after setting the CHPL_TARGET_ARCH environment
>> > variable. You need to run 'make' again since the runtime builds
>> > with the target architecture setting.
>> >
>> > Of course this and the CHPL_MEM errors could be bugs...
>> >
>> > Hope that helps,
>> >
>> > -michael
>> >
>> >> On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote:
>> >>
>> >> quick, partial, follow up:
>> >>
>> >> recompiled chapel with the CHPL_TARGET_ARCH=native and hit the
>> following
>> >> error:
>> >>
>> >> /home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19:
>> >>
>> /home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nati
>> >> ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file
>> or
>> >> directory
>> >> make: *** No rule to make target
>> >>
>> `/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nat
>> >> ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop.
>> >> error: compiling generated source
>> >>
>> >> I tweaked the gastnet folder names, but never got it to work yet.
>> >> Subsequent attempts to run the compiled program resulted in "set
>> CHPL_MEM
>> >> to a more appropriate mem type".  I tried setting the mem type to a few
>> >> different choices and didn't help.  Needs more
>> >> investigation.
>> >>
>> >>
>> >> I did managed to run some local --fast tests though and the --local
>> >> version ran in 5s while the --no-local version ran in about 12s.
>> >>
>> >> Additionally, I played around with the local keyword and that also
>> seems
>> >> to make some difference, at least locally.  I need to try this on the
>> >> distributed version when I stand it back up.
>> >>
>> >>> On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]>
>> wrote:
>> >>>
>> >>>
>> >>> For well-written programs, most of the --local vs. --no-local
>> >>> differences show up as CPU overhead rather than network overhead.
>> I.e.,
>> >>> we tend not to do unnecessary communications, we simply execute extra
>> >>> scalar code to determine that communication is unnecessary,
>> >> and the presence of this code hinders the back-end compiler's ability
>> to
>> >> optimize the per-node computation.
>> >>>
>> >>> Here's an example:  A given array access like A[i] may not know
>> whether
>> >>> the access is local or remote, so will introduce communication-related
>> >>> code to disambiguate.  Even if that code doesn't generate
>> communication,
>> >>> it can be ugly enough to throw the back-end
>> >> C compiler off.
>> >>>
>> >>> Some workarounds to deal with this:
>> >>>
>> >>> * the 'local' block (documented in doc/release/technotes/README.local
>> >>>  -- this is a big hammer and likely to be replaced with more data-
>> >>>  centric capabilities going forward, but can be helpful in the
>> >>>  meantime if you can get it working in a chunk of code.
>> >>>
>> >>> * I don't know how fully-fleshed out these features are, but there
>> >>>  are at least draft capabilities for .localAccess() and .localSlice()
>> >>>  methods on some array types to reduce overheads like the ones in
>> >>>  my simple example above.  I.e., if I know that A[i] is local for a
>> >>>  given distributed array, A.localAccess(i) is likely to give better
>> >>>  performance.
>> >>>
>> >>>
>> >>> But maybe I should start with a higher-level question:  What kinds of
>> >>> data structures does your code use, and what types of idioms do you
>> use
>> >>> to get multi-locale executions going?  (e.g., distributed arrays +
>> >>> foralls? Or more manually-distributed data structures
>> >> + task parallelism + on-clauses?)
>> >>>
>> >>> Thanks,
>> >>>
>> >>> -Brad
>> >>>
>> >>>
>> >>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>> >>>>
>> >>>> I was aware of the on-going progress in optimizing the comm, but I'll
>> >>>> take
>> >>>> a look at your docs.  I'll also give the -local vs --no-local
>> >>>> experiment a
>> >>>> try.
>> >>>>
>> >>>> I tested the network layer and saw my nodes were operating near peak
>> >>>> network capacity so it wasn't a transport issue.  Yes, the CPUs (of
>> >>>> which
>> >>>> there are 4 per node) were nearly fully pegged.  Considering the
>> level
>> >>>> of
>> >>>> complexity of the code, I suspect it was mostly overhead.  I even was
>> >>>> looking for a way to pin the execution to a single proc as I wonder
>> if
>> >>>> there was some kind of thrashing going on between procs.  The funny
>> >>>> thing
>> >>>> was the more I tried to optimize the program to do less network
>> >>>> traffic,
>> >>>> the slower it got.
>> >>>>
>> >>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]>
>> >>>> wrote:
>> >>>>
>> >>>>>
>> >>>>> Hi Brian --
>> >>>>>
>> >>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you
>> >>>>> better
>> >>>>> results as long as you're not cross-compiling.  Alternatively, you
>> >>>>> can set
>> >>>>> it to 'none' which will squash the warning you're getting.  In any
>> >>>>> case, I
>> >>>>> wouldn't expect the lack of --specialize optimizations to be the
>> >>>>> problem
>> >>>>> here (but if you're throwing components of --fast manually, you'd
>> >>>>> want to
>> >>>>> be sure to add -O in addition to --no-checks).
>> >>>>>
>> >>>>> Generally speaking, Chapel programs compiled for --no-local
>> >>>>> (multi-locale
>> >>>>> execution) tend to generate much worse per-node code than those
>> >>>>> compiled
>> >>>>> for --local (single-locale execution), and this is an area of active
>> >>>>> optimization effort.  See the "Performance Optimizations and
>> >>>>> Generated Code
>> >>>>> Improvements" release note slides at:
>> >>>>>
>> >>>>>        http://chapel.cray.com/download.html#releaseNotes
>> >> http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf
>> >> <http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf>
>> >>>>>
>> >>>>> and particularly, the section entitled "the 'local field' pragma"
>> for
>> >>>>> more
>> >>>>> details on this effort (starts at slide 34).
>> >>>>>
>> >>>>> In a nutshell, the Chapel compiler conservatively assumes that
>> things
>> >>>>> are
>> >>>>> remote rather than local when in doubt (to emphasize correctness
>> over
>> >>>>> fast
>> >>>>> but incorrect programs), and then gets into doubt far more often
>> than
>> >>>>> it
>> >>>>> should.  We're currently working on tightening up this gap.
>> >>>>>
>> >>>>> This could explain the full difference in performance that you're
>> >>>>> seeing,
>> >>>>> or something else may be happening.  One way to check into this
>> might
>> >>>>> be to
>> >>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see
>> how
>> >>>>> much
>> >>>>> overhead is added.  The fact that all CPUs are pegged is a good
>> >>>>> indication
>> >>>>> that you don't have a problem with load balance or distributing
>> >>>>> data/computation across nodes, I'd guess?
>> >>>>>
>> >>>>> -Brad
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Mon, 11 May 2015, Brian Guarraci wrote:
>> >>>>>
>> >>>>> I should add that I did supply --no-checks and that helped about
>> 10%.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>> It says:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'.
>> >>>>>>> If you
>> >>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a
>> >>>>>>> proper
>> >>>>>>> value.
>> >>>>>>> It's unclear which target arch is appropriate.
>> >>>>>>>
>> >>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]
>> >
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>> Hi Brian --
>> >>>>>>>>
>> >>>>>>>> Getting --fast working should definitely be the first priority.
>> >>>>>>>> What
>> >>>>>>>> about it fails to work?
>> >>>>>>>>
>> >>>>>>>> -Brad
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote:
>> >>>>>>>>
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> I've been testing my search index on my 16 node ARM system and
>> >>>>>>>>> have
>> >>>>>>>>> been
>> >>>>>>>>> running into some strange behavior.  The cool part is that the
>> >>>>>>>>> locale
>> >>>>>>>>> partitioning concept seems to work well, the downside is that
>> the
>> >>>>>>>>> system
>> >>>>>>>>> is
>> >>>>>>>>> very slow.  I've rewritten the approach a few different ways and
>> >>>>>>>>> haven't
>> >>>>>>>>> made a dent, so wanted to ask a few questions.
>> >>>>>>>>>
>> >>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize
>> >>>>>>>>> (--fast
>> >>>>>>>>> doesn't work).  Is this going to significantly affect
>> cross-locale
>> >>>>>>>>> performance?
>> >>>>>>>>>
>> >>>>>>>>> I've looked at the generated C code and tried to minimize the
>> >>>>>>>>> _comm_
>> >>>>>>>>> operations in core methods, but doesn't seem to help.  Network
>> >>>>>>>>> usage is
>> >>>>>>>>> still quite low (100K/s) while CPUs are pegged.  Are there any
>> >>>>>>>>> profiling
>> >>>>>>>>> tools I can use to understand what might be going on here?
>> >>>>>>>>>
>> >>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM
>> >>>>>>>>> records
>> >>>>>>>>> in
>> >>>>>>>>> under 10s.  With 16 nodes, it takes 10min to do 100k records.
>> >>>>>>>>>
>> >>>>>>>>> Wondering if there's some systemic issue at play here and how
>> can
>> >>>>>>>>> further
>> >>>>>>>>> investigate.
>> >>>>>>>>>
>> >>>>>>>>> Thanks!
>> >>>>>>>>> Brian
>> >
>>
>
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to