Hello, I'm going to present my thinking about why we need BSP for efficient processing of graph data. Even though I'll mention about only graph processing, I think that the essential problem is similar to that of matrix computation.
In MR, a mapper just reads a sequence of records from some given read-only partitions. In order to get further data from other partitions, the program needs to pass a reduce step. However, the reduce is for merging all intermediate data, and it is inappropriate to exchanging data between nodes. In addition, the implementation of a partitioner is very difficult if the relations of data is very complicated. Although assigning only one reduce is very straightforward method, it will definitely deteriorate scalability. In addition, In order to reduce data space the MR program has to pass a map phase, and the MR program will output a subset of the original data. If such a step is conducted many times in order to get the final result, the program incurs much communication overhead between two MR chains. During this step the data are rearranged and repartitioned by a simple hash partitioner; due to its inherit characteristics, the hash partitioner cannot consider locality of graph data. In such an environment, it is very difficult to preserve the data locality. As you know, data locality plays an important role for graph problems. In contrast, as you know, the essential idea of BSP is to communicate only necessary data among data nodes by using point to point communication. If original data are preserving data locality in terms of their connectivity, BSP enables each node to keep partitioned data and to process some computation with only additional information obtained from other partitions. Therefore, each node can determine a subset of the final result with additional information received from other nodes. Therefore, BSP is very suited for data processing having complex relations like graph data. Due to these reasons, the graph package should be implemented with BSP. Because of the same reason, the matrix package implemented with BSP would be more efficient. What do you think about my thinking? Best regards, -- Hyunsik Choi Database & Information Systems Group, Korea University http://diveintodata.org On Fri, Sep 18, 2009 at 6:30 PM, Edward J. Yoon <[email protected]>wrote: > Firstly, We need to share our plans and consider about overall > architecture. > > What's the BSP? What's the relationship between matrix and graph? > What's the plan of matrix and graph packages? What's the our main > goal? > > On Fri, Sep 18, 2009 at 5:52 PM, Apache Wiki <[email protected]> wrote: > > Dear Wiki user, > > > > You have subscribed to a wiki page or wiki category on "Hama Wiki" for > change notification. > > > > The following page has been changed by HyunsikChoi: > > http://wiki.apache.org/hama/GraphPackage > > > > New page: > > = The Graph Package (Angrapa) = > > The graph package, called Angrapa, is an large-scale graph data > management framework for analytical processing. It is still an ongoing > project. It will employ massive parallelism on Hadoop. It aims to achieve > the scalability for processing tera bytes or peta bytes graph data. Angrapa > will be used in a variety of scientific and industrial areas, such as data > mining, machine learning, information retrieval, bioinformatics, and social > networks, required to process large-scale graph data. > > > > = Description = > > The graph package is new programming framework for graph processing. > > > > = The Main Goal = > > * Easy APIs familar to graph features > > * Store structure suited to graph data when it comes to considering the > connectivity of graph data > > * Applying data communication method (i.e., BSP) without deterioration > of graph data locality > > > > > > -- > Best Regards, Edward J. Yoon @ NHN, corp. > [email protected] > http://blog.udanax.org >
