Of course the second problem is how to partition the graph so that the sync is minimized... Because as far as I know, when the data has been partitioned to some machine, it never moves from those machine (other machines get them through sync), am I right?. Or maybe the data may be moved among machines if it leads to lesser number of sync?
Anyway, I'm currently researching on running max-flow algorithm on top of MapReduce. So I was looking whether BSP is a good add on. It turns out that Hama's BSP is quite different from MapReduce. Not exactly in form of map() and reduce() function anymore... I was thinking BSP as an add on to MapReduce: In current MR, the map function is supplied one <key,value> at a time. If the map() function is modified so that it takes in a RANGE of <key,value> pairs of certain range, then in some sense it becomes BSP (as explained in my comment here: http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html#comments). However, Hama's BSP seemed to take a different approach... Felix Halim On Wed, Feb 3, 2010 at 7:08 PM, Hyunsik Choi <[email protected]> wrote: > Hi Felix Halim, > > Such terrible global sync problem may only occurs in a few of problems > that have global states. We are trying to research better ways. > However, many problems (e.g., finding isomorphic subgraphs, computing > clustering coefficient, and finding cohesive groups) that have global > states don't have such global sync problems. The graph package of Hama > is appropriate to these problems. > > We will think about PathOthersHaveFollowed(). Thank you for your > advice! If you are interested in this project, why don't you > participate in our project? > > Best regards, > -- > Hyunsik Choi > Database & Information Systems Group, Korea Univ. > http://diveintodata.org > > > On Wed, Feb 3, 2010 at 5:54 PM, Felix Halim <[email protected]> wrote: >> >> On Wed, Feb 3, 2010 at 4:41 PM, Edward J. Yoon <[email protected]> wrote: >> >>> I think the needToVisit() function might as well need to communicate >> >>> with other machine: >> > >> > Hmm, You're exactly right. In that example, needToVisit() function >> > checks the IsVisited from some shared-space (e.g., HBase or DBMS, ..., >> > etc). We wrote with intent to simplify it. >> >> From the pseudocode, I see that every Vertex will request IsVisited. >> In a large graph, the HBase or DBMS will be overwhelmed by many tiny >> requests from each Vertex. >> Does the needToVisit() has "bulk query" that aggregates the tiny >> requests into a single request? >> >> >> > Integer is the distance at "Map<Vertex, Integer> input, Map<Vertex, >> > Integer> nextQueue", but it could be replaced as other object. for >> > example, new PathWeHaveFollowed(). Then, perhaps we need not some >> > shared-space. >> >> But we also need to know PathOthersHaveFollowed() don't we? >> That's why we need HBase to store the global "visited" states of each node. >> >> Felix Halim >
