[
https://issues.apache.org/jira/browse/HAMA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015451#comment-13015451
]
Thomas Jungblut commented on HAMA-359:
--------------------------------------
For the first naive partitioning idea it is necessary to call this method:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getEndKeys%28%29
This will return your maximum row key, in this case these are positive ID's, so
the range is going from zero to the max (return of the method).
Afterwards you are distributing the ranges across the available grooms.
For an adjacency matrix this is going to be a quite good partitioning: you
always have the same row length.
In my case this is a list, so you have to watch out for edge hotspots (in my
example there is a fixed number of edges, so this won't be critical at all).
*The longer running partioning:*
My advanced partitioning idea is that you'll setup a HTable scanner job, and
you pass the number of grooms to the job via configuration.
Now the job (Map-Phase) determines how many row bytes for a specific key chunk
(say 10k vertices, or at least based on the number of grooms) are currently
inside this chunk.
So you'll be able in the reduce phase to determine which chunk has a hotspot,
or if it is already evenly distributed. In the case of evenly distribution you
know that you can use the naive partitioning.
But if it is not even, you'll have to chunk the hotspot(s) itself (say divide
by 2). Write out the "chunked-hotspot" boundaries (or put it into a conf) and
let the algorithm run again until it is better distributed.
So this is going to take a while since the real-partitioning is NP-complete. I
read about Kernighan–Lin algorithm which is a heuristic:
http://en.wikipedia.org/wiki/Kernighan%E2%80%93Lin_algorithm
This is actually an edge-weight partitioning, we should replace the edge weight
with the size of the row containing this edge.
Maybe you have a better (or at least faster idea?)
The next problem is the repartitioning:
So if you are using Dijkstra you can simply remove the already visited
lowest-weighted vertex. So this row is going to be overhead for the grooms in
the next step. If you are unlucky, every adjacency row in Groom1 will be
dismissed and Groom2 has to do all the work on its own.
So to avoid this, you have to keep track of the boundaries that are passed to
the grooms before a new step.
A big advantage of Google Pregel is the caching (didn't actually read the
paper, but I read it somewhere else) so every "Pregel-Groom" has a real big
amount of memory to cache the chunk he is always working on.
So we should not repartition too much, but this is currently just a thought so
skip that caching stuff.
This is a cool page too:
http://www.eecs.berkeley.edu/~demmel/cs267/lecture18/lecture18.html#link_4.2
> Development of Shortest Path Finding Algorithm
> ----------------------------------------------
>
> Key: HAMA-359
> URL: https://issues.apache.org/jira/browse/HAMA-359
> Project: Hama
> Issue Type: New Feature
> Components: examples
> Affects Versions: 0.2.0
> Reporter: Edward J. Yoon
> Assignee: Edward J. Yoon
> Labels: gsoc, gsoc2011, mentor
> Fix For: 0.3.0
>
> Original Estimate: 2016h
> Remaining Estimate: 2016h
>
> The goal of this project is development of parallel algorithm for finding a
> Shortest Path using Hama BSP.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira