Re: [Hama Wiki] Update of "GraphPackage" by HyunsikChoi

Hyunsik Choi Sun, 20 Sep 2009 06:18:03 -0700

Hello,

I'm going to present my thinking about why we need BSP for efficient
processing of graph data. Even though I'll mention about only graph
processing, I think that the essential problem is similar to that of matrix
computation.

In MR, a mapper just reads a sequence of records from some given read-only
partitions. In order to get further data from other partitions, the program
needs to pass a reduce step. However, the reduce is for merging all
intermediate data, and it is inappropriate to exchanging data between
nodes. In addition, the implementation of a partitioner is very difficult if
the relations of data is very complicated. Although assigning only one
reduce is very straightforward method, it will definitely deteriorate
scalability.

In addition, In order to reduce data space the MR program has to pass a map
phase, and the MR program will output a subset of the original data. If such
a step is conducted many times in order to get the final result, the program
incurs much communication overhead between two MR chains. During this step
the data are rearranged and repartitioned by a simple hash partitioner; due
to its inherit characteristics, the hash partitioner cannot
consider locality of graph data. In such an environment, it is very
difficult to preserve the data locality. As you know, data locality plays an
important role for graph problems.

In contrast, as you know, the essential idea of BSP is to communicate only
necessary data among data nodes by using point to point communication. If
original data are preserving data locality in terms of their connectivity,
BSP enables each node to keep partitioned data and to process some
computation with only additional information obtained from other partitions.
Therefore, each node can determine a subset of the final result with
additional information received from other nodes. Therefore, BSP is very
suited for data processing having complex relations like graph data.

Due to these reasons, the graph package should be implemented with BSP.
Because of the same reason, the matrix package implemented with BSP would be
more efficient.

What do you think about my thinking?

Best regards,
--
Hyunsik Choi
Database & Information Systems Group, Korea University
http://diveintodata.org

On Fri, Sep 18, 2009 at 6:30 PM, Edward J. Yoon <[email protected]>wrote:

> Firstly, We need to share our plans and consider about overall
> architecture.
>
> What's the BSP? What's the relationship between matrix and graph?
> What's the plan of matrix and graph packages? What's the our main
> goal?
>
> On Fri, Sep 18, 2009 at 5:52 PM, Apache Wiki <[email protected]> wrote:
> > Dear Wiki user,
> >
> > You have subscribed to a wiki page or wiki category on "Hama Wiki" for
> change notification.
> >
> > The following page has been changed by HyunsikChoi:
> > http://wiki.apache.org/hama/GraphPackage
> >
> > New page:
> > = The Graph Package (Angrapa) =
> > The graph package, called Angrapa, is an large-scale graph data
> management framework for analytical processing. It is still an ongoing
> project. It will employ massive parallelism on Hadoop. It aims to achieve
> the scalability for processing tera bytes or peta bytes graph data. Angrapa
> will be used in a variety of scientific and industrial areas, such as data
> mining, machine learning, information retrieval, bioinformatics, and social
> networks, required to process large-scale graph data.
> >
> > = Description =
> > The graph package is new programming framework for graph processing.
> >
> > = The Main Goal =
> >  * Easy APIs familar to graph features
> >  * Store structure suited to graph data when it comes to considering the
> connectivity of graph data
> >  * Applying data communication method (i.e., BSP) without deterioration
> of graph data locality
> >
>
>
>
> --
> Best Regards, Edward J. Yoon @ NHN, corp.
> [email protected]
> http://blog.udanax.org
>

Re: [Hama Wiki] Update of "GraphPackage" by HyunsikChoi

Reply via email to