Re: [Chapel-developers] slow locale comm

Michael Ferguson Tue, 19 May 2015 13:42:36 -0700

Hi Brian -

Re. HDFS, are you aware that Chapel already has some
(experimental) HDFS support? See
http://chapel.cray.com/docs/latest/modules/standard/HDFS.html


That might make it easier to do the I/O patterns
you were hoping for...

In addition, I *think* you might be able to get HDFSiter
to read lines (if you use a regular expression matching a line).
We do have (again experimental) forall support for
HDFSiters.

-michael

On 5/19/15, 9:45 AM, "Brian Guarraci" <[email protected]> wrote:

>answers / comments inline
>
>On Fri, May 15, 2015 at 8:54 AM, Michael Ferguson
><[email protected]> wrote:
>
>Hi Brian -
>
>See below for lots of feedback on the program you posted...
>
>On 5/14/15, 1:19 PM, "Brian Guarraci" <[email protected]> wrote:
>
>>I've managed to create a simple example which seems to encapsulate the
>>general, but ultra-simplified, access patterns that my search index is
>>doing.  The example runs about as slow and is basically not doing much
>>other than allocating nodes on
>> a linked list and incrementing an atomic.  I'd really love to know what
>>naive thing I'm doing that turns out to be really slow - and what I can
>>do about it.
>>
>>
>>https://github.com/briangu/libev-examples/blob/_bg_p2/chapel/crosstalk.ch
>>p
>>l
>
>
>I'd like to try running this on some other systems to understand if they
>have a similar performance issue. Are you using the GASNet UDP conduit?
>Do you have a runnable version of the simplified program? Also,
>if you kept the same skeleton of the program but changed the type
>of word to int - do you still see performance problems?
>
>
>
>I've tried reverting to native ints, uints in all experiments.  No
>difference.  I too was curious if the ARM processor was having issues
>with the width.
> 
>
>If not,
>the problems might mostly stem from the string implementation which
>we are actively working on... (You could also try using tuples and/or
>records to represent fixed-length strings as a sequence of bytes/uint(8)s.
>
>
>
>
>thats interesting.  I will give it a try.
> 
>
>As it is, I believe this program will leak memory in strings because
>of problems that we're working on fixing).
>
>
>
>
>
>Got it. Thanks!
> 
>
>Also, I'm a little bit surprised that this program is totally serial.
>We have some draft work to do 'forall file.lines()'. Would that be useful
>to you? On a related point - why are you using atomic ints if the program
>is totally serial? Was the full program working in parallel? Or were
>you working up to a parallel version?
>
>
>
>
>
>As I mentioned above, at this point I have reduced everything down to the
>simplest form in an effort to clarify the slowness.  The idea state is
>that reads are parallelized, enqueuing of indexing is multi-threaded, and
>writers are single-threaded on a
> per-partition basis.  The search system I'm most familiar with uses a
>single writer and multiple readers and still can easily index 10s of
>thousands of records per second per partition.
> 
>
>In addition, I'm thinking about making the I/O system support:
> * efficiently reading from a remote file
> * creating a channel to queue up work between two or more locales
>Would those features help?
>
>
>
>
>Yes, I think so.  
>
>
>Using HDFS is my current plan, but having the ability to read a locale's
>local file from another locale would be awesome.  Personally, I'd use a
>URI scheme to address it.  Maybe use locale as the scheme? Where
>"locale://0/foo.txt" could read foo.txt from
> CWD of locale 0.  I suppose you could also address it in the open file
>command as well, removing the need to synthesize strings.
>
>
>I think a generic pub-sub system for distributing tasks would be very
>useful.  It may also be more useful if you allow customized hash
>partitioning functions, so that there's basically an implied partitioning
>scheme and consumers could subscribe to one
> or more virtual queues.  This would also allow for replicas on a hash
>partition.  More generically, I think it would be very useful to have a
>way for the publisher to have a queue of the responses (possibly the same
>type of system with multiple consumers).
> I could imagine taking a TCP request, publishing the work item, getting
>the result, and then responding to the TCP request with the response.  I
>could also imagine more standard use-cases like fanning out image
>processing.
>
>
>
>I also happen to know a fair amount about information retrieval...
>so I think I have an idea what you're trying to do here. I'd be
>happy to discuss with you other ways it might make sense to
>make a distributed indexer (in terms of overall program architecture).
>But, I won't start that discussion unless you opt in, since it's
>not what you were asking about...
>
>
>
>
>Sure!  What I've done so far is pretty basic so I'm curious what you
>think would work best in your experience.
>
>
>
>As a pedantic exercise, I'm going to go through your little program
>(with some hints from your other github code) and point out which
>communication is occurring where. I'm not saying that you do or
>don't already know these details. Likewise, I'm not saying that the
>resulting performance is where we would like it to be. Just trying
>to help you (and anybody else reading this) understand what's going
>on with the current implementation. For that reason, I'll probably
>answer questions you didn't mean to ask...
>
>>
>>use Logging, Memory, IO, Partitions, Time;
>>
>>
>>class Node {
>>  var word: string;
>>  var next: Node;
>>}
>>
>>
>>class PartitionInfo {
>>  var head: Node;
>>  var count: atomic int;
>>}
>>
>>
>>class WordIndex {
>>  var wordIndex: [0..Partitions.size-1] PartitionInfo;
>>
>>
>>  proc WordIndex() {
>>    for i in wordIndex.domain {
>>      on Partitions[i] {
>>        writeln("adding ", i);
>>        wordIndex[i] = new PartitionInfo();
>>      }
>>    }
>>  }
>>
>>
>>  proc indexWord(word: string) {
>>    var partition = partitionForWord(word);
>>    var info = wordIndex[partition];
>
>These should not cause communication because you're still running
>everything on Locale 0 at this point, and the WordIndex
>class instance as well as its wordIndex array member are
>all also on Locale 0.
>
>Now info is a PartitionInfo class instance.
>
>In WordIndex(), you allocate each PartitionInfo
>object on different nodes, so this on statement should actually
>do something useful. Furthermore, the 'head' and 'count' fields
>should be local at that point, but the compiler might still
>generate code that could communicate... This is a case where
>the 'local field' pragma Brad brought up before might help
>create more optimal code.
>
>>    on info {
>
>Note that the body of this 'on' statement uses two variables
>from the function: info and word.
>
>Remote value forwarding (an optimization within the Chapel compiler)
>should pass info (a pointer to a class object) along with the message
>starting the body of this 'on' statement. If that optimization is
>disabled,
>the generated code would probably communicate back to Locale 0 to read the
>value of the variable 'info' (since info is a variable storing
>a pointer to a class instance, and info resides on Locale 0).
>The same reasoning applies to 'word' which is a pointer to string data.
>My experiments with the current implementation indicate that
>the string pointer is included in remote value forwarding but that
>the string data will be communicated when you use it.
>
>
>>      info.head = new Node(word, info.head);
>
>Naturally, this call allocates heap memory, which can be slow...
>If show allocations are the problem, life usually
>gets better if you're able to use tcmalloc. (e.g.
>build with CHPL_MEM=tcmalloc).
>
>
>The compiler probably assumes that it could be running
>on a different locale than where 'info' is stored, so
>I wouldn't be surprised if that adds some significant
>overhead (that I believe you can reduce with 'local field').
>
>>      info.count.add(1);
>
>Here, the compiler probably doesn't know that this statement
>will run on the same locale where 'info.count' is stored.
>Of course it is in fact, and the 'local field' pragma would
>probably help. But, assuming that the compiler doesn't know
>that, the call info.count.add(1) will turn into something
>like this:
> var to_add = 1;
> on info.count {
>   some_atomic_add_intrinsic(local pointer to info.count, to_add);
> }
>
>That means that your like info.count.add(1) will turn into
>a function call and probably pack and unpack arguments for the
>on statement. All this adds overhead when it's not needed.
>
>In the end, this statement will be a local operation, but I don't
>understand why it's atomic at all. If you were running
>this in parallel, you'd run into problems with race
>conditions on info.head; so you'd want to compare-and-swap
>there. But right now we can't compare-and-swap class instances
>when building for more than one locale. So instead you'd
>probably fall back on protecting both of these fields
>with a lock (you'd use a sync variable for that in Chapel).
>
>
>
>
>Yes, the code is basically representative of the type of work I would
>like to do, not necessarily correct.  I noticed I couldn't do atomic
>class pointers, so that resulting in a bunch of ad-hoc code to roughly
>get the right behavior.  Since it was all
> slow, I didn't go for correctness quite yet.
> 
>
>
>I would probably use an array for this and just expand
>the array periodically. I'd probably also try to send batches
>of words in the 'on' statements instead of just one, in order
>to better hide the latency of doing the 'on' statement.
>
>
>
>
>
>My first set of major optimization experiments used batching.  It didn't
>help.  That could be from something else though.  Generally agree
>batching would be nice here.
> 
>
>
>>    }
>>  }
>>}
>>
>
>>
>>proc main() {
>>  initPartitions();
>
>For anybody not looking also at the github code,
>the partitions are stored declared like this:
>
>  config var partitionDimensions = 16; //Locales.size;
>  var Partitions: [0..partitionDimensions-1] locale;
>
>That means that they are in an array stored on locale 0.
>(Also, the code populating this Partitions array is
> strange and probably non-optimal, but that's not what
> you asked about)
>
>
>
>
>The reason it's strange is that I found that the order in which the
>hostnames were mapped to locales seemed inconsistent.  If you use the
>partition ID and hostname as part of your calculation, e.g. for
>identifying a filename, then you would get different
> results over time.  The code there is a quick hack to order by hostname
>so that given a set of hostnames the partitions => hostname map is
>consistent.
> 
>
>Personally, I would probably not use a Partitions array
>at all and instead write a function to go from word
>hash -> locale like this (e.g.)
>
> proc localeForWord(word: string): locale {
>   return Locales[ genHashKey32(word) % Locales.size ];
> }
>
>But, you presumably have some reason to want to possibly
>have multiple partitions on a particular locale. I would
>probably do that with something that can compute the
>partition/locale in a consistent manner without using
>an array at all. The reason is that the array uses cause
>communication. An alternative would be to replicate the
>Partitions array on all locales once it's set up. One
>way to do that would be to use the ReplicatedDist()
>distribution, which you can see shown in
> test/{release/}examples/primers/distributions.chpl
>
>
>
>
>
>This makes sense in terms of communication.  I think the reason I created
>the partitions array was to have a place to hold the PartitionInfo.  It
>was immediately obvious when I started this how each locale could have
>its own PartitionInfo without having
> a top-level reference.  Now that I'm more familiar with Chapel, I think
>this should be more straightforward - and worthwhile since you say
>there's a big cost.
> 
>
>In any case, in this actual program these effects are
>limited to the program initialization...
>
>>
>>  var wordIndex = new WordIndex();
>
>This allocates heap memory of course..
>
>>
>>  var t: Timer;
>>  t.start();
>>
>>
>>  var infile = open("words.txt", iomode.r);
>>  var reader = infile.reader();
>>  var word: string;
>>  while (reader.readln(word)) {
>
>As I mentioned, we'd like to provide 'forall file.lines()'
>which would probably be nice here. Also the current implementation
>has problems with string memory leaks that we're actively working
>on. The read string or a copy of it might be leaked. If you just
>wanted a serial 'for all lines in file', you can do that with
>
>  for word in infile.lines() {
>    wordIndex.indexWord(word);
>  }
>
>file.lines() is demonstrated in
>  test/{release/}examples/primers/fileIO.chpl
>
>
>>    wordIndex.indexWord(word);
>>  }
>>
>>
>>  t.stop();
>>  timing("indexing complete in ",t.elapsed(TimeUnits.microseconds), "
>>microseconds");
>>}
>
>
>
>
>
>
>Generally, it seems like I need to go to a more standard approach (for
>search on Lucene at least) of allocating large blocks of memory and
>indexing into them.  Don't use strings (right now until leak is fixed).
>Fix partitioning so that there's no need
> for a top-level reference array - my guess is that this is really bad
>perf. wise.  And fix the race conditions around class pointers.
>
>
>Thanks for the feedback!
>
>
> 
>
>Thanks for trying all of this out,
>
>-michael
>
>
>
>
>
>
>


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to