Re: [Chapel-developers] slow locale comm

Michael Ferguson Fri, 15 May 2015 08:55:34 -0700

Hi Brian -

See below for lots of feedback on the program you posted...

On 5/14/15, 1:19 PM, "Brian Guarraci" <[email protected]> wrote:

>I've managed to create a simple example which seems to encapsulate the
>general, but ultra-simplified, access patterns that my search index is
>doing.  The example runs about as slow and is basically not doing much
>other than allocating nodes on
> a linked list and incrementing an atomic.  I'd really love to know what
>naive thing I'm doing that turns out to be really slow - and what I can
>do about it.
>
>
>https://github.com/briangu/libev-examples/blob/_bg_p2/chapel/crosstalk.chp
>l

I'd like to try running this on some other systems to understand if they
have a similar performance issue. Are you using the GASNet UDP conduit?
Do you have a runnable version of the simplified program? Also,
if you kept the same skeleton of the program but changed the type
of word to int - do you still see performance problems? If not,
the problems might mostly stem from the string implementation which
we are actively working on... (You could also try using tuples and/or
records to represent fixed-length strings as a sequence of bytes/uint(8)s.
As it is, I believe this program will leak memory in strings because
of problems that we're working on fixing).

Also, I'm a little bit surprised that this program is totally serial.
We have some draft work to do 'forall file.lines()'. Would that be useful
to you? On a related point - why are you using atomic ints if the program
is totally serial? Was the full program working in parallel? Or were
you working up to a parallel version?

In addition, I'm thinking about making the I/O system support:
 * efficiently reading from a remote file
 * creating a channel to queue up work between two or more locales
Would those features help?

I also happen to know a fair amount about information retrieval...
so I think I have an idea what you're trying to do here. I'd be
happy to discuss with you other ways it might make sense to
make a distributed indexer (in terms of overall program architecture).
But, I won't start that discussion unless you opt in, since it's
not what you were asking about...

As a pedantic exercise, I'm going to go through your little program
(with some hints from your other github code) and point out which
communication is occurring where. I'm not saying that you do or
don't already know these details. Likewise, I'm not saying that the
resulting performance is where we would like it to be. Just trying
to help you (and anybody else reading this) understand what's going
on with the current implementation. For that reason, I'll probably
answer questions you didn't mean to ask...

>
>use Logging, Memory, IO, Partitions, Time;
>
>
>class Node {
>  var word: string;
>  var next: Node;
>}
>
>
>class PartitionInfo {
>  var head: Node;
>  var count: atomic int;
>}
>
>
>class WordIndex {
>  var wordIndex: [0..Partitions.size-1] PartitionInfo;
>
>
>  proc WordIndex() {
>    for i in wordIndex.domain {
>      on Partitions[i] {
>        writeln("adding ", i);
>        wordIndex[i] = new PartitionInfo();
>      }      
>    }
>  }
>
>
>  proc indexWord(word: string) {
>    var partition = partitionForWord(word);
>    var info = wordIndex[partition];

These should not cause communication because you're still running
everything on Locale 0 at this point, and the WordIndex
class instance as well as its wordIndex array member are
all also on Locale 0.

Now info is a PartitionInfo class instance.

In WordIndex(), you allocate each PartitionInfo
object on different nodes, so this on statement should actually
do something useful. Furthermore, the 'head' and 'count' fields
should be local at that point, but the compiler might still
generate code that could communicate... This is a case where
the 'local field' pragma Brad brought up before might help
create more optimal code.

>    on info {

Note that the body of this 'on' statement uses two variables
from the function: info and word.

Remote value forwarding (an optimization within the Chapel compiler)
should pass info (a pointer to a class object) along with the message
starting the body of this 'on' statement. If that optimization is disabled,
the generated code would probably communicate back to Locale 0 to read the
value of the variable 'info' (since info is a variable storing
a pointer to a class instance, and info resides on Locale 0).
The same reasoning applies to 'word' which is a pointer to string data.
My experiments with the current implementation indicate that
the string pointer is included in remote value forwarding but that
the string data will be communicated when you use it.

>      info.head = new Node(word, info.head);

Naturally, this call allocates heap memory, which can be slow...
If show allocations are the problem, life usually
gets better if you're able to use tcmalloc. (e.g.
build with CHPL_MEM=tcmalloc).

The compiler probably assumes that it could be running
on a different locale than where 'info' is stored, so
I wouldn't be surprised if that adds some significant
overhead (that I believe you can reduce with 'local field').

>      info.count.add(1);

Here, the compiler probably doesn't know that this statement
will run on the same locale where 'info.count' is stored.
Of course it is in fact, and the 'local field' pragma would
probably help. But, assuming that the compiler doesn't know
that, the call info.count.add(1) will turn into something
like this:
 var to_add = 1;
 on info.count {
   some_atomic_add_intrinsic(local pointer to info.count, to_add);
 }

That means that your like info.count.add(1) will turn into
a function call and probably pack and unpack arguments for the
on statement. All this adds overhead when it's not needed.

In the end, this statement will be a local operation, but I don't
understand why it's atomic at all. If you were running
this in parallel, you'd run into problems with race
conditions on info.head; so you'd want to compare-and-swap
there. But right now we can't compare-and-swap class instances
when building for more than one locale. So instead you'd
probably fall back on protecting both of these fields
with a lock (you'd use a sync variable for that in Chapel).

I would probably use an array for this and just expand
the array periodically. I'd probably also try to send batches
of words in the 'on' statements instead of just one, in order
to better hide the latency of doing the 'on' statement.

>    }
>  }
>}
>

>
>proc main() {
>  initPartitions();

For anybody not looking also at the github code,
the partitions are stored declared like this:

  config var partitionDimensions = 16; //Locales.size;
  var Partitions: [0..partitionDimensions-1] locale;

That means that they are in an array stored on locale 0.
(Also, the code populating this Partitions array is
 strange and probably non-optimal, but that's not what
 you asked about)
Personally, I would probably not use a Partitions array
at all and instead write a function to go from word
hash -> locale like this (e.g.)

 proc localeForWord(word: string): locale {
   return Locales[ genHashKey32(word) % Locales.size ];
 }

But, you presumably have some reason to want to possibly
have multiple partitions on a particular locale. I would
probably do that with something that can compute the
partition/locale in a consistent manner without using
an array at all. The reason is that the array uses cause
communication. An alternative would be to replicate the
Partitions array on all locales once it's set up. One
way to do that would be to use the ReplicatedDist()
distribution, which you can see shown in
 test/{release/}examples/primers/distributions.chpl

In any case, in this actual program these effects are
limited to the program initialization...

>
>  var wordIndex = new WordIndex();

This allocates heap memory of course..

>
>  var t: Timer;
>  t.start();
>
>
>  var infile = open("words.txt", iomode.r);
>  var reader = infile.reader();
>  var word: string;
>  while (reader.readln(word)) {

As I mentioned, we'd like to provide 'forall file.lines()'
which would probably be nice here. Also the current implementation
has problems with string memory leaks that we're actively working
on. The read string or a copy of it might be leaked. If you just
wanted a serial 'for all lines in file', you can do that with

  for word in infile.lines() {
    wordIndex.indexWord(word);
  }

file.lines() is demonstrated in
  test/{release/}examples/primers/fileIO.chpl

>    wordIndex.indexWord(word);
>  }
>
>
>  t.stop();
>  timing("indexing complete in ",t.elapsed(TimeUnits.microseconds), "
>microseconds");
>}

Thanks for trying all of this out,

-michael

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to