Re: [Chapel-developers] slow locale comm

Brian Guarraci Tue, 19 May 2015 06:46:35 -0700

answers / comments inline

On Fri, May 15, 2015 at 8:54 AM, Michael Ferguson <[email protected]>
wrote:


> Hi Brian -
>
> See below for lots of feedback on the program you posted...
>
> On 5/14/15, 1:19 PM, "Brian Guarraci" <[email protected]> wrote:
>
> >I've managed to create a simple example which seems to encapsulate the
> >general, but ultra-simplified, access patterns that my search index is
> >doing.  The example runs about as slow and is basically not doing much
> >other than allocating nodes on
> > a linked list and incrementing an atomic.  I'd really love to know what
> >naive thing I'm doing that turns out to be really slow - and what I can
> >do about it.
> >
> >
> >
> https://github.com/briangu/libev-examples/blob/_bg_p2/chapel/crosstalk.chp
> >l
>
>
> I'd like to try running this on some other systems to understand if they
> have a similar performance issue. Are you using the GASNet UDP conduit?
> Do you have a runnable version of the simplified program? Also,
> if you kept the same skeleton of the program but changed the type
> of word to int - do you still see performance problems?


I've tried reverting to native ints, uints in all experiments.  No
difference.  I too was curious if the ARM processor was having issues with
the width.


> If not,
> the problems might mostly stem from the string implementation which
> we are actively working on... (You could also try using tuples and/or
> records to represent fixed-length strings as a sequence of bytes/uint(8)s.
>

thats interesting.  I will give it a try.


> As it is, I believe this program will leak memory in strings because
> of problems that we're working on fixing).
>
>
Got it. Thanks!


> Also, I'm a little bit surprised that this program is totally serial.
> We have some draft work to do 'forall file.lines()'. Would that be useful
> to you? On a related point - why are you using atomic ints if the program
> is totally serial? Was the full program working in parallel? Or were
> you working up to a parallel version?
>
>
As I mentioned above, at this point I have reduced everything down to the
simplest form in an effort to clarify the slowness.  The idea state is that
reads are parallelized, enqueuing of indexing is multi-threaded, and
writers are single-threaded on a per-partition basis.  The search system
I'm most familiar with uses a single writer and multiple readers and still
can easily index 10s of thousands of records per second per partition.


> In addition, I'm thinking about making the I/O system support:
>  * efficiently reading from a remote file
>  * creating a channel to queue up work between two or more locales
> Would those features help?
>

Yes, I think so.

Using HDFS is my current plan, but having the ability to read a locale's
local file from another locale would be awesome.  Personally, I'd use a URI
scheme to address it.  Maybe use locale as the scheme? Where
"locale://0/foo.txt" could read foo.txt from CWD of locale 0.  I suppose
you could also address it in the open file command as well, removing the
need to synthesize strings.

I think a generic pub-sub system for distributing tasks would be very
useful.  It may also be more useful if you allow customized hash
partitioning functions, so that there's basically an implied partitioning
scheme and consumers could subscribe to one or more virtual queues.  This
would also allow for replicas on a hash partition.  More generically, I
think it would be very useful to have a way for the publisher to have a
queue of the responses (possibly the same type of system with multiple
consumers).  I could imagine taking a TCP request, publishing the work
item, getting the result, and then responding to the TCP request with the
response.  I could also imagine more standard use-cases like fanning out
image processing.

I also happen to know a fair amount about information retrieval...
> so I think I have an idea what you're trying to do here. I'd be
> happy to discuss with you other ways it might make sense to
> make a distributed indexer (in terms of overall program architecture).
> But, I won't start that discussion unless you opt in, since it's
> not what you were asking about...
>

Sure!  What I've done so far is pretty basic so I'm curious what you think
would work best in your experience.

As a pedantic exercise, I'm going to go through your little program
> (with some hints from your other github code) and point out which
> communication is occurring where. I'm not saying that you do or
> don't already know these details. Likewise, I'm not saying that the
> resulting performance is where we would like it to be. Just trying
> to help you (and anybody else reading this) understand what's going
> on with the current implementation. For that reason, I'll probably
> answer questions you didn't mean to ask...
>
> >
> >use Logging, Memory, IO, Partitions, Time;
> >
> >
> >class Node {
> >  var word: string;
> >  var next: Node;
> >}
> >
> >
> >class PartitionInfo {
> >  var head: Node;
> >  var count: atomic int;
> >}
> >
> >
> >class WordIndex {
> >  var wordIndex: [0..Partitions.size-1] PartitionInfo;
> >
> >
> >  proc WordIndex() {
> >    for i in wordIndex.domain {
> >      on Partitions[i] {
> >        writeln("adding ", i);
> >        wordIndex[i] = new PartitionInfo();
> >      }
> >    }
> >  }
> >
> >
> >  proc indexWord(word: string) {
> >    var partition = partitionForWord(word);
> >    var info = wordIndex[partition];
>
> These should not cause communication because you're still running
> everything on Locale 0 at this point, and the WordIndex
> class instance as well as its wordIndex array member are
> all also on Locale 0.
>
> Now info is a PartitionInfo class instance.
>
> In WordIndex(), you allocate each PartitionInfo
> object on different nodes, so this on statement should actually
> do something useful. Furthermore, the 'head' and 'count' fields
> should be local at that point, but the compiler might still
> generate code that could communicate... This is a case where
> the 'local field' pragma Brad brought up before might help
> create more optimal code.
>
> >    on info {
>
> Note that the body of this 'on' statement uses two variables
> from the function: info and word.
>
> Remote value forwarding (an optimization within the Chapel compiler)
> should pass info (a pointer to a class object) along with the message
> starting the body of this 'on' statement. If that optimization is disabled,
> the generated code would probably communicate back to Locale 0 to read the
> value of the variable 'info' (since info is a variable storing
> a pointer to a class instance, and info resides on Locale 0).
> The same reasoning applies to 'word' which is a pointer to string data.
> My experiments with the current implementation indicate that
> the string pointer is included in remote value forwarding but that
> the string data will be communicated when you use it.
>
>
> >      info.head = new Node(word, info.head);
>
> Naturally, this call allocates heap memory, which can be slow...
> If show allocations are the problem, life usually
> gets better if you're able to use tcmalloc. (e.g.
> build with CHPL_MEM=tcmalloc).
>
>
> The compiler probably assumes that it could be running
> on a different locale than where 'info' is stored, so
> I wouldn't be surprised if that adds some significant
> overhead (that I believe you can reduce with 'local field').
>
> >      info.count.add(1);
>
> Here, the compiler probably doesn't know that this statement
> will run on the same locale where 'info.count' is stored.
> Of course it is in fact, and the 'local field' pragma would
> probably help. But, assuming that the compiler doesn't know
> that, the call info.count.add(1) will turn into something
> like this:
>  var to_add = 1;
>  on info.count {
>    some_atomic_add_intrinsic(local pointer to info.count, to_add);
>  }
>
> That means that your like info.count.add(1) will turn into
> a function call and probably pack and unpack arguments for the
> on statement. All this adds overhead when it's not needed.
>
> In the end, this statement will be a local operation, but I don't
> understand why it's atomic at all. If you were running
> this in parallel, you'd run into problems with race
> conditions on info.head; so you'd want to compare-and-swap
> there. But right now we can't compare-and-swap class instances
> when building for more than one locale. So instead you'd
> probably fall back on protecting both of these fields
> with a lock (you'd use a sync variable for that in Chapel).
>

Yes, the code is basically representative of the type of work I would like
to do, not necessarily correct.  I noticed I couldn't do atomic class
pointers, so that resulting in a bunch of ad-hoc code to roughly get the
right behavior.  Since it was all slow, I didn't go for correctness quite
yet.


>
> I would probably use an array for this and just expand
> the array periodically. I'd probably also try to send batches
> of words in the 'on' statements instead of just one, in order
> to better hide the latency of doing the 'on' statement.
>
>
My first set of major optimization experiments used batching.  It didn't
help.  That could be from something else though.  Generally agree batching
would be nice here.


> >    }
> >  }
> >}
> >
>
> >
> >proc main() {
> >  initPartitions();
>
> For anybody not looking also at the github code,
> the partitions are stored declared like this:
>
>   config var partitionDimensions = 16; //Locales.size;
>   var Partitions: [0..partitionDimensions-1] locale;
>
> That means that they are in an array stored on locale 0.
> (Also, the code populating this Partitions array is
>  strange and probably non-optimal, but that's not what
>  you asked about)
>

The reason it's strange is that I found that the order in which the
hostnames were mapped to locales seemed inconsistent.  If you use the
partition ID and hostname as part of your calculation, e.g. for identifying
a filename, then you would get different results over time.  The code there
is a quick hack to order by hostname so that given a set of hostnames the
partitions => hostname map is consistent.


> Personally, I would probably not use a Partitions array
> at all and instead write a function to go from word
> hash -> locale like this (e.g.)
>
>  proc localeForWord(word: string): locale {
>    return Locales[ genHashKey32(word) % Locales.size ];
>  }
>
> But, you presumably have some reason to want to possibly
> have multiple partitions on a particular locale. I would
> probably do that with something that can compute the
> partition/locale in a consistent manner without using
> an array at all. The reason is that the array uses cause
> communication. An alternative would be to replicate the
> Partitions array on all locales once it's set up. One
> way to do that would be to use the ReplicatedDist()
> distribution, which you can see shown in
>  test/{release/}examples/primers/distributions.chpl
>
>
This makes sense in terms of communication.  I think the reason I created
the partitions array was to have a place to hold the PartitionInfo.  It was
immediately obvious when I started this how each locale could have its own
PartitionInfo without having a top-level reference.  Now that I'm more
familiar with Chapel, I think this should be more straightforward - and
worthwhile since you say there's a big cost.


> In any case, in this actual program these effects are
> limited to the program initialization...
>
> >
> >  var wordIndex = new WordIndex();
>
> This allocates heap memory of course..
>
> >
> >  var t: Timer;
> >  t.start();
> >
> >
> >  var infile = open("words.txt", iomode.r);
> >  var reader = infile.reader();
> >  var word: string;
> >  while (reader.readln(word)) {
>
> As I mentioned, we'd like to provide 'forall file.lines()'
> which would probably be nice here. Also the current implementation
> has problems with string memory leaks that we're actively working
> on. The read string or a copy of it might be leaked. If you just
> wanted a serial 'for all lines in file', you can do that with
>
>   for word in infile.lines() {
>     wordIndex.indexWord(word);
>   }
>
> file.lines() is demonstrated in
>   test/{release/}examples/primers/fileIO.chpl
>
>
> >    wordIndex.indexWord(word);
> >  }
> >
> >
> >  t.stop();
> >  timing("indexing complete in ",t.elapsed(TimeUnits.microseconds), "
> >microseconds");
> >}
>
>
>
Generally, it seems like I need to go to a more standard approach (for
search on Lucene at least) of allocating large blocks of memory and
indexing into them.  Don't use strings (right now until leak is fixed).
Fix partitioning so that there's no need for a top-level reference array -
my guess is that this is really bad perf. wise.  And fix the race
conditions around class pointers.

Thanks for the feedback!



> Thanks for trying all of this out,
>
> -michael
>
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to