Re: tree-based / log(p) communication cost algorithms for River?

Gregg Wonderly Thu, 10 Sep 2015 19:59:47 -0700

Why doesn’t java spaces let you submit requests and have them worked on and 
results returned without assigning any particular node any specific 
responsibility?


Gregg

> On Aug 1, 2015, at 12:06 PM, Bryan Thompson <[email protected]> wrote:
> 
> First, thanks for the responses and the interest in this topic.  I have
> been traveling for the last few days and have not had a chance to follow up.
> 
> - The network would have multiple switches, but not multiple routers.  The
> typical target is an infiniband network.  Of course, Java let's us bind to
> infiniband now.
> 
> - As far as I understand it, MPI relies on each node executing the same
> logic in a distributed communication pattern.  Thus the concept of a leader
> election to determine a balanced tree probably does not show up. Instead,
> the tree is expressed in terms of the MPI rank assigned to each node.  I am
> not suggesting that the same design pattern is used for river.
> 
> - We do need a means to define the relationship between a distributed
> communication pattern and the manner in which data are decomposed onto a
> cluster.  I am not sure that the proposal above gives us this directly, but
> some extension of it probably would.   Let me give an example.  In our
> application, we are distributing the edges of a graph among a 2-D cluster
> of compute nodes (p x p machines).  The distribution is done by assigning
> the edges to compute nodes based on some function (key-range,
> hash-function) of the source and target vertex identifiers.  When we want
> to read all edges in the graph, we need to do an operation that is data
> parallel across either the rows (in-edges) or the columns (out-edges) of
> the cluster. See http://mapgraph.io/papers/UUSCI-2014-002.pdf for a TR that
> describes this communication pattern for a p x p cluster of GPUs.  In order
> to make this work with river, we would somehow have to associate the nodes
> with their positions in this 2-D topology.  For example, we could annotate
> each node with a "row" and "column" attribute that specifies its location
> in the compute grid.  We could then have a communicator for each row and
> each column based on the approach you suggest above.
> 
> The advantage of such tree based communication patterns is quite large.
> They require log(p) communication operations where you would otherwise do p
> communication operations.  So, for example, only 4 communication operations
> vs 16 for a 16 node cluster.
> 
> Thanks,
> Bryan
> 
> 
> On Wed, Jul 29, 2015 at 1:17 PM, Greg Trasuk <[email protected]> wrote:
> 
>> 
>> I’ve wondered about doing this in the past, but for the workloads I’ve
>> worked with, I/O time has been relatively low compared to processing time.
>> I’d guess there’s some combination of message frequency, cluster size and
>> message size that makes it compelling.
>> 
>> The idea is interesting, though, because it could enable things like
>> distributed JavaSpaces, where we’d be distributing the search queries, etc.
>> 
>> I would guess the mechanism would look like:
>> 
>> -Member nodes want to form a multicast group.
>> -They elect a leader
>> -Leader figures out a balanced notification tree, and passes it on to each
>> member
>> -Leader receives multicast message and starts the message passing into the
>> tree
>> -Recipients pass the message to local recipients, and also to their
>> designated repeater recipients (how many?)
>> -Somehow we monitor for disappearing members and then recast the leader
>> election if necessary.
>> 
>> Paxon protocol would be involved, I’d guess.  Does anyone have references
>> to any academic work on presence monitoring and leader election, beyond
>> Lamport’s original paper?
>> 
>> I also wonder, is there a reason not to just use Multicast if it’s
>> available (I realize that it isn’t always supported - Amazon EC2, for
>> instance).
>> 
>> Interesting question!
>> 
>> Cheers,
>> 
>> Greg Trasuk
>> 
>>> On Jul 29, 2015, at 12:40 PM, Bryan Thompson <[email protected]> wrote:
>>> 
>>> Hello,
>>> 
>>> I am wondering if anyone has looked into creating tree based algorithms
>> for
>>> multi-cast of RMI messages for river.  Assuming a local cluster, such
>>> patterns generally have log(p) cost for a cluster with p nodes.
>>> 
>>> For the curious, this is how many MPI messages are communicated under the
>>> hood.
>>> 
>>> Thanks,
>>> Bryan
>> 
>>

Re: tree-based / log(p) communication cost algorithms for River?

Reply via email to