[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098098#comment-13098098
 ] 

Avery Ching commented on GIRAPH-11:
-----------------------------------

I'm going to assume you're asking about the current partitioning.  If I'm 
wrong, I'll address what we plan to do in the future.  The current partitioning 
is implemented by assuming that the input splits are sorted globally (i.e. two 
input split of {A, B, C} {D, E}).  It will break the input splits into vertex 
ranges where the boundaries will not change.  These vertex ranges can be passed 
around the workers via several different balancers.  The balancer can be set 
via setVertexRangeBalancerClass() from GiraphJob or with the right 
configuration parameter (giraph.vertexRangeBalancerClass).  We have some 
implementations for a static balancer (no vertex movement, default), and an 
auto balancer (configurable to balance based on vertices or edges).  You're 
free to implement your own as well.  Hope that answers some of the questions, 
let me know if you have more.

> Improve the graph distribution of Giraph
> ----------------------------------------
>
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
>  If the user data is not sorted by the vertex id, they must first run a 
> MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
> inconvenient.
> Giraph graph partitioning is currently range based and there are some 
> advantages and disadvantages of this approach.  The proposal of this JIRA 
> would be to allow for both range and hash based partitioning and provide more 
> flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range 
> based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to