Thanks to Thomas and Avery for the response.

> For Giraph you are quite correct, all the stuff is submitted as a MR job.
But a full map stage is not a superstep, the whole computation is a done in
one mapping phase.

So a map task in MR corresponds to a computation phase in a superstep. Once
the computation phase for a superstep is complete, the vertex output is
stored using the defined OutputFormat, the message sent (may be) to another
vertex and the map task is stopped. Once the barrier synchronization phase
is complete, another set of map tasks are invoked for the vertices which
have received a message.

In a regular MR Job (not Giraph) the number of Map tasks equals to the
number of InputSplits. But, in case of Giraph the total number of maps to
be launched is usually more than the number of input vertices.

Please let me know if I am correct.

> Where are the incoming, outgoing messages and state stored
> Memory

What happens if a particular node is lost in case of Hama and Giraph? Are
the messages not persisted somewhere to be fetched later.

> In Giraph, vertices can move around workers between supersteps.  A vertex
will run on the worker that it is assigned to.

Is data locality considered while moving vertices around workers in Giraph?

> As you can see, you could write a MapReduce Engine with BSP on top of
Apache Hama.

It's being the done other way, BSP is implemented in Giraph using Hadoop.

Praveen

On Fri, Dec 9, 2011 at 12:51 PM, Avery Ching <ach...@apache.org> wrote:

>  Hi Praveen,
>
> Answers inline.  Hope that helps!
>
> Avery
>
> On 12/8/11 10:16 PM, Praveen Sripati wrote:
>
> Hi,
>
> I know about MapReduce/Hadoop and trying to get myself around
> BSP/Hama-Giraph by comparing MR and BSP.
>
> - Map Phase in MR is similar to Computation Phase in BSP. BSP allows for
> process to exchange data in the communication phase, but there is no
> communication between the mappers in the Map Phase. Though the data flows
> from Map tasks to Reducer tasks. Please correct me if I am wrong. Any other
> significant differences?
>
> I suppose you can think of it that way.  I like to compare a BSP superstep
> to a MapReduce job since it's computation and communication.
>
> - After going through the documentation for Hama and Giraph, noticed that
> they both use Hadoop as the underlying framework. In both Hama and Giraph
> an MR Job is submitted. Does each superstep in BSP correspond to a Job in
> MR? Where are the incoming, outgoing messages and state stored - HDFS or
> HBase or Local or pluggable?
>
>  My understanding of Hama is that they have their own BSP framework.
> Giraph can be run on a Hadoop installation, it does not have its own
> computational framework.  A Giraph job is submitted to a Hadoop
> installation as a Map-only job.  Hama will have its own BSP lauching
> framework.
>
> In Giraph, the state is stored all in memory.  Graphs are loaded/stored
> through VertexInputFormat/VertexOutputFormat (very similar to Hadoop).  You
> could implement your own VertexInputFormat/VertexOutputFormat to use HDFS,
> HBase, etc. as your graph stable storage.
>
> - If a Vertex is deactivated and again activated after receiving a
> message, does is run on the same node or a different node in the cluster?
>
>  In Giraph, vertices can move around workers between supersteps.  A vertex
> will run on the worker that it is assigned to.
>
> Regards,
> Praveen
>
>
>

Reply via email to