You can certainly implement BSP on top of a MapReduce implementation. But this is going to be very very expensive. Consider that all communication in MapReduce will go through the phase of storing map outputs locally (disk) before being send to the reducer. Also, consider than the entire graph must be loaded and stored during each MapReduce job. In Giraph, the graph is loaded once prior to the first superstep and stored once at the end of the last superstep. All messaging is currently done in memory.

Another way to think about it would be that Giraph breaks down to a MapReduce implementation if you checkpoint every superstep (minus the re-loading of the graph).


Hope that helps,

Avery

On 12/10/11 9:26 AM, Praveen Sripati wrote:
Avery,

> Communication between mappers is not part of the MapReduce computing model. Therefore, it doesn't make sense for them to include it as it would unnecessarily complicate the fault-tolerance recovery.

I agree that it doesn't make sense to complicate things by introducing communication between mappers.

But, my original query was, why can't the RPC communication be avoided with mappers in Giraph similar to MR by not running multiple supersteps in a single map process. If I am not wrong Hadoop supports MMR (multiple Maps) type of jobs and each map can map to the computation phase in a single superstep. Agree, that there is an over head of launching maps again and again, but communication between the mappers can be avoided.

I was trying to figure out the rational behind the approach taken in Giraph.

Regards,
Praveen

On Sat, Dec 10, 2011 at 10:44 PM, Avery Ching <ach...@apache.org <mailto:ach...@apache.org>> wrote:

    On 12/9/11 10:22 PM, Praveen Sripati wrote:
    Jack,

    > Giraph maps do communicate: via RPC.  This is done repeatedly
    in every mapper, during the compute phase.  This is something
    that is not normal to MapReduce, it is special to Giraph.

    There must have been definitely some thought around this. But, we
    can also have a mapper correspond to just the computation phase
    in a superstep and avoid communication between the mappers as in
    MapReduce. Later spawn another set of mappers for the next
    superset. There might be some reason why communication between
    mappers was avoided in MR.

    Communication between mappers is not part of the MapReduce
    computing model.  Therefore, it doesn't make sense for them to
    include it as it would unnecessarily complicate the
    fault-tolerance recovery.


    Any thoughts?

    Regards,
    Praveen

    On Sat, Dec 10, 2011 at 10:35 AM, Jake Mannix
    <jake.man...@gmail.com <mailto:jake.man...@gmail.com>> wrote:



        On Fri, Dec 9, 2011 at 8:16 PM, Praveen Sripati
        <praveensrip...@gmail.com <mailto:praveensrip...@gmail.com>>
        wrote:

            Jake,


            > Let's not crosspost, please, it make the thread of
            conversation totally opaque as to who is talking about what.

            Agree. I got it after the OP.


            > There is only one set of map tasks for the Giraph job -
            those long-running map tasks run possibly many supersteps.

            OK. But, map tasks don't communicate with each other. How
            are messages sent across in the communication phase of a
            super step that happens within a map?


        Giraph maps do communicate: via RPC.  This is done repeatedly
        in every mapper, during the compute phase.  This is something
        that is not normal to MapReduce, it is special to Giraph.

            > In Giraph, vertices can move around workers between
            supersteps.  A vertex will run on the worker that it is
            assigned to.

            Is there any advantage of moving the processing of
            vertices from one worker to another. Can't there be
            affinity between a worker and the vertices it processes?


        Often there will be affinity, but if the graph itself evolves
        during computation (some sort of iterative pruning or
        clustering), then moving around may make sense.  Also: if
        nodes die.
          -jake


            Regards,
            Praveen

            On Fri, Dec 9, 2011 at 11:33 PM, Jake Mannix
            <jake.man...@gmail.com <mailto:jake.man...@gmail.com>> wrote:

                [hama-user to bcc:]

                Let's not crosspost, please, it make the thread of
                conversation totally opaque as to who is talking
                about what.

                On Fri, Dec 9, 2011 at 1:42 AM, Praveen Sripati
                <praveensrip...@gmail.com
                <mailto:praveensrip...@gmail.com>> wrote:

                    Thanks to Thomas and Avery for the response.

                    > For Giraph you are quite correct, all the stuff
                    is submitted as a MR job. But a full map stage is
                    not a superstep, the whole computation is a done
                    in one mapping phase.

                    So a map task in MR corresponds to a computation
                    phase in a superstep. Once the computation phase
                    for a superstep is complete, the vertex output is
                    stored using the defined OutputFormat, the
                    message sent (may be) to another vertex and the
                    map task is stopped. Once the barrier
                    synchronization phase is complete, another set of
                    map tasks are invoked for the vertices which have
                    received a message.


                In Giraph, each superstep does not lead to storage
                into an OutputFormat.  The data lives all in memory
                from the time the first superstep starts to the time
                the final superstep stops (except that for tolerance
                of failures, checkpoints are stored to disk at
                user-specified intervals).  There is only one set of
                map tasks for the Giraph job - those long-running map
                tasks run possibly many supersteps.

                    In a regular MR Job (not Giraph) the number of
                    Map tasks equals to the number of InputSplits.
                    But, in case of Giraph the total number of maps
                    to be launched is usually more than the number of
                    input vertices.


                Number of maps > number of input vertices?  Not at
                all.  That would be insane.  We want to be able to
                run over multi-billion vertex graphs.  We're going to
                launch multiple billions of mappers?   The splitting
                of the data in Giraph is very similar to in a regular
                MR job, divide up your input data among the number of
                mappers you have, and you're off and running.


                    > Where are the incoming, outgoing messages and
                    state stored
                    > Memory

                    What happens if a particular node is lost in case
                    of Hama and Giraph? Are the messages not
                    persisted somewhere to be fetched later.


                If nodes are lost, the system has to back up to the
                most recent checkpoint, where graph state has been
                persisted to HDFS.  Messages are not currently
                persisted, but the state at which the graph was in to
                produce any messages was.

                    > In Giraph, vertices can move around workers
                    between supersteps.  A vertex will run on the
                    worker that it is assigned to.

                    Is data locality considered while moving vertices
                    around workers in Giraph?


                Data is all in memory, and typical graph algorithms
                are basically sending roughly the size of the entire
                graph (number of total edges) out over distributed
                RPC in any given superstep, so shuffling the graph
                around by RPC is not much more to do.


                    > As you can see, you could write a MapReduce
                    Engine with BSP on top of Apache Hama.

                    It's being the done other way, BSP is implemented
                    in Giraph using Hadoop.


                I'll let the Hama people explain to you about how one
                would implement MR on top of Hama.  You are correct
                that in Giraph, the Hadoop JobTracker/TaskTracker and
                HDFS are used as substrate to help implement BSP
                (although I would not say that "MR" is being used to
                implement BSP, as there is no MR going on in Giraph).

                  -jake



                    Praveen

                    On Fri, Dec 9, 2011 at 12:51 PM, Avery Ching
                    <ach...@apache.org <mailto:ach...@apache.org>> wrote:

                        Hi Praveen,

                        Answers inline.  Hope that helps!

                        Avery

                        On 12/8/11 10:16 PM, Praveen Sripati wrote:
                        Hi,

                        I know about MapReduce/Hadoop and trying to
                        get myself around BSP/Hama-Giraph by
                        comparing MR and BSP.

                        - Map Phase in MR is similar to Computation
                        Phase in BSP. BSP allows for process to
                        exchange data in the communication phase,
                        but there is no communication between the
                        mappers in the Map Phase. Though the data
                        flows from Map tasks to Reducer tasks.
                        Please correct me if I am wrong. Any other
                        significant differences?
                        I suppose you can think of it that way.  I
                        like to compare a BSP superstep to a
                        MapReduce job since it's computation and
                        communication.
                        - After going through the documentation for
                        Hama and Giraph, noticed that they both use
                        Hadoop as the underlying framework. In both
                        Hama and Giraph an MR Job is submitted. Does
                        each superstep in BSP correspond to a Job in
                        MR? Where are the incoming, outgoing
                        messages and state stored - HDFS or HBase or
                        Local or pluggable?

                        My understanding of Hama is that they have
                        their own BSP framework.  Giraph can be run
                        on a Hadoop installation, it does not have
                        its own computational framework.  A Giraph
                        job is submitted to a Hadoop installation as
                        a Map-only job.  Hama will have its own BSP
                        lauching framework.

                        In Giraph, the state is stored all in
                        memory.  Graphs are loaded/stored through
                        VertexInputFormat/VertexOutputFormat (very
                        similar to Hadoop).  You could implement your
                        own VertexInputFormat/VertexOutputFormat to
                        use HDFS, HBase, etc. as your graph stable
                        storage.

                        - If a Vertex is deactivated and again
                        activated after receiving a message, does is
                        run on the same node or a different node in
                        the cluster?

                        In Giraph, vertices can move around workers
                        between supersteps.  A vertex will run on the
                        worker that it is assigned to.

                        Regards,
                        Praveen









Reply via email to