Avery Ching commented on GIRAPH-11:

The hash partitioning will be based on hashCode() by default, but the user can 
implement something they like as well based on the vertex id.  I am designing 
it to get hash based and hash range based.  In a pure hash-based distribution, 
you should get great load balancing.  In a hash-range based distribution, the 
user could possibly get some locality benefits without changing anything from 
the hash based partitioning.  Then finally, there should be a way for the user 
to do a pure range based split of the id space, but this requires the most work 
by the user to specify their division of the id space (depends on the type).

The hash based and hash-range based schemes will be implemented by default and 
will be selectable by users.  The range based scheme will be a partial 
implementation since we require users to do the id range partitioning.  
Additionally, we will provide the API for users to implement their own graph 
partitioning scheme.

Let me know what you think.

> Improve the graph distribution of Giraph
> ----------------------------------------
>                 Key: GIRAPH-11
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-11
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
>  If the user data is not sorted by the vertex id, they must first run a 
> MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
> inconvenient.
> Giraph graph partitioning is currently range based and there are some 
> advantages and disadvantages of this approach.  The proposal of this JIRA 
> would be to allow for both range and hash based partitioning and provide more 
> flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range 
> based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to