Re: [Chapel-developers] slow locale comm

Brian Guarraci Fri, 15 May 2015 09:07:56 -0700

Hi Michael

Quick partial response as I'm heading out the door:


Gasnet udp

Everything is serial to simplify but yes the goal is parallel indexing.

You should be able to use make search and make fanout to build the apps. Master 
branch is up to date with everything.

Thanks!

> On May 15, 2015, at 8:54 AM, Michael Ferguson <[email protected]> wrote:
> 
> Hi Brian -
> 
> See below for lots of feedback on the program you posted...
> 
>> On 5/14/15, 1:19 PM, "Brian Guarraci" <[email protected]> wrote:
>> 
>> I've managed to create a simple example which seems to encapsulate the
>> general, but ultra-simplified, access patterns that my search index is
>> doing.  The example runs about as slow and is basically not doing much
>> other than allocating nodes on
>> a linked list and incrementing an atomic.  I'd really love to know what
>> naive thing I'm doing that turns out to be really slow - and what I can
>> do about it.
>> 
>> 
>> https://github.com/briangu/libev-examples/blob/_bg_p2/chapel/crosstalk.chp
>> l
> 
> 
> I'd like to try running this on some other systems to understand if they
> have a similar performance issue. Are you using the GASNet UDP conduit?
> Do you have a runnable version of the simplified program? Also,
> if you kept the same skeleton of the program but changed the type
> of word to int - do you still see performance problems? If not,
> the problems might mostly stem from the string implementation which
> we are actively working on... (You could also try using tuples and/or
> records to represent fixed-length strings as a sequence of bytes/uint(8)s.
> As it is, I believe this program will leak memory in strings because
> of problems that we're working on fixing).
> 
> Also, I'm a little bit surprised that this program is totally serial.
> We have some draft work to do 'forall file.lines()'. Would that be useful
> to you? On a related point - why are you using atomic ints if the program
> is totally serial? Was the full program working in parallel? Or were
> you working up to a parallel version?
> 
> In addition, I'm thinking about making the I/O system support:
> * efficiently reading from a remote file
> * creating a channel to queue up work between two or more locales
> Would those features help?
> 
> I also happen to know a fair amount about information retrieval...
> so I think I have an idea what you're trying to do here. I'd be
> happy to discuss with you other ways it might make sense to
> make a distributed indexer (in terms of overall program architecture).
> But, I won't start that discussion unless you opt in, since it's
> not what you were asking about...
> 
> As a pedantic exercise, I'm going to go through your little program
> (with some hints from your other github code) and point out which
> communication is occurring where. I'm not saying that you do or
> don't already know these details. Likewise, I'm not saying that the
> resulting performance is where we would like it to be. Just trying
> to help you (and anybody else reading this) understand what's going
> on with the current implementation. For that reason, I'll probably
> answer questions you didn't mean to ask...
> 
>> 
>> use Logging, Memory, IO, Partitions, Time;
>> 
>> 
>> class Node {
>> var word: string;
>> var next: Node;
>> }
>> 
>> 
>> class PartitionInfo {
>> var head: Node;
>> var count: atomic int;
>> }
>> 
>> 
>> class WordIndex {
>> var wordIndex: [0..Partitions.size-1] PartitionInfo;
>> 
>> 
>> proc WordIndex() {
>>   for i in wordIndex.domain {
>>     on Partitions[i] {
>>       writeln("adding ", i);
>>       wordIndex[i] = new PartitionInfo();
>>     }      
>>   }
>> }
>> 
>> 
>> proc indexWord(word: string) {
>>   var partition = partitionForWord(word);
>>   var info = wordIndex[partition];
> 
> These should not cause communication because you're still running
> everything on Locale 0 at this point, and the WordIndex
> class instance as well as its wordIndex array member are
> all also on Locale 0.
> 
> Now info is a PartitionInfo class instance.
> 
> In WordIndex(), you allocate each PartitionInfo
> object on different nodes, so this on statement should actually
> do something useful. Furthermore, the 'head' and 'count' fields
> should be local at that point, but the compiler might still
> generate code that could communicate... This is a case where
> the 'local field' pragma Brad brought up before might help
> create more optimal code.
> 
>>   on info {
> 
> Note that the body of this 'on' statement uses two variables
> from the function: info and word.
> 
> Remote value forwarding (an optimization within the Chapel compiler)
> should pass info (a pointer to a class object) along with the message
> starting the body of this 'on' statement. If that optimization is disabled,
> the generated code would probably communicate back to Locale 0 to read the
> value of the variable 'info' (since info is a variable storing
> a pointer to a class instance, and info resides on Locale 0).
> The same reasoning applies to 'word' which is a pointer to string data.
> My experiments with the current implementation indicate that
> the string pointer is included in remote value forwarding but that
> the string data will be communicated when you use it.
> 
> 
>>     info.head = new Node(word, info.head);
> 
> Naturally, this call allocates heap memory, which can be slow...
> If show allocations are the problem, life usually
> gets better if you're able to use tcmalloc. (e.g.
> build with CHPL_MEM=tcmalloc).
> 
> 
> The compiler probably assumes that it could be running
> on a different locale than where 'info' is stored, so
> I wouldn't be surprised if that adds some significant
> overhead (that I believe you can reduce with 'local field').
> 
>>     info.count.add(1);
> 
> Here, the compiler probably doesn't know that this statement
> will run on the same locale where 'info.count' is stored.
> Of course it is in fact, and the 'local field' pragma would
> probably help. But, assuming that the compiler doesn't know
> that, the call info.count.add(1) will turn into something
> like this:
> var to_add = 1;
> on info.count {
>   some_atomic_add_intrinsic(local pointer to info.count, to_add);
> }
> 
> That means that your like info.count.add(1) will turn into
> a function call and probably pack and unpack arguments for the
> on statement. All this adds overhead when it's not needed.
> 
> In the end, this statement will be a local operation, but I don't
> understand why it's atomic at all. If you were running
> this in parallel, you'd run into problems with race
> conditions on info.head; so you'd want to compare-and-swap
> there. But right now we can't compare-and-swap class instances
> when building for more than one locale. So instead you'd
> probably fall back on protecting both of these fields
> with a lock (you'd use a sync variable for that in Chapel).
> 
> I would probably use an array for this and just expand
> the array periodically. I'd probably also try to send batches
> of words in the 'on' statements instead of just one, in order
> to better hide the latency of doing the 'on' statement.
> 
>>   }
>> }
>> }
> 
>> 
>> proc main() {
>> initPartitions();
> 
> For anybody not looking also at the github code,
> the partitions are stored declared like this:
> 
>  config var partitionDimensions = 16; //Locales.size;
>  var Partitions: [0..partitionDimensions-1] locale;
> 
> That means that they are in an array stored on locale 0.
> (Also, the code populating this Partitions array is
> strange and probably non-optimal, but that's not what
> you asked about)
> Personally, I would probably not use a Partitions array
> at all and instead write a function to go from word
> hash -> locale like this (e.g.)
> 
> proc localeForWord(word: string): locale {
>   return Locales[ genHashKey32(word) % Locales.size ];
> }
> 
> But, you presumably have some reason to want to possibly
> have multiple partitions on a particular locale. I would
> probably do that with something that can compute the
> partition/locale in a consistent manner without using
> an array at all. The reason is that the array uses cause
> communication. An alternative would be to replicate the
> Partitions array on all locales once it's set up. One
> way to do that would be to use the ReplicatedDist()
> distribution, which you can see shown in
> test/{release/}examples/primers/distributions.chpl
> 
> 
> In any case, in this actual program these effects are
> limited to the program initialization...
> 
>> 
>> var wordIndex = new WordIndex();
> 
> This allocates heap memory of course..
> 
>> 
>> var t: Timer;
>> t.start();
>> 
>> 
>> var infile = open("words.txt", iomode.r);
>> var reader = infile.reader();
>> var word: string;
>> while (reader.readln(word)) {
> 
> As I mentioned, we'd like to provide 'forall file.lines()'
> which would probably be nice here. Also the current implementation
> has problems with string memory leaks that we're actively working
> on. The read string or a copy of it might be leaked. If you just
> wanted a serial 'for all lines in file', you can do that with
> 
>  for word in infile.lines() {
>    wordIndex.indexWord(word);
>  }
> 
> file.lines() is demonstrated in
>  test/{release/}examples/primers/fileIO.chpl
> 
> 
>>   wordIndex.indexWord(word);
>> }
>> 
>> 
>> t.stop();
>> timing("indexing complete in ",t.elapsed(TimeUnits.microseconds), "
>> microseconds");
>> }
> 
> 
> Thanks for trying all of this out,
> 
> -michael
> 

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] slow locale comm

Reply via email to