[
https://issues.apache.org/jira/browse/GIRAPH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eli Reisman updated GIRAPH-247:
-------------------------------
Attachment: GIRAPH-247-1.patch
> Introduce edge based partitioning for InputSplits
> -------------------------------------------------
>
> Key: GIRAPH-247
> URL: https://issues.apache.org/jira/browse/GIRAPH-247
> Project: Giraph
> Issue Type: Improvement
> Components: graph
> Affects Versions: 0.2.0
> Reporter: Eli Reisman
> Assignee: Eli Reisman
> Priority: Minor
> Labels: patch
> Fix For: 0.2.0
>
> Attachments: GIRAPH-247-1.patch
>
>
> Experiments on larger data input sets while maintaining low memory profile
> has revealed that typical social graph data is very lumpy and partitioning by
> vertices can easily overload some unlucky worker nodes who end up with
> partitions containing highly-connected vertices while other nodes process
> partitions with the same number of vertices but far fewer out-edges per
> vertex. This often results in cascading failures during data load-in even on
> tiny data sets.
> By partitioning using edges (the default I set in
> GiraphJob.MAX_EDGES_PER_PARTITION_DEFAULT is 200,000 per partition, or the
> old default # of vertices, whichever the user's input format reaches first
> when reading InputSplits) I have seen dramatic "de-lumpification" of data,
> allow the processing of 8x larger data sets before memory problems occur at a
> given configuration setting.
> This needs more tuning, but comes with a -Dgiraph.maxEdgesPerPartition that
> can be set to more edges/partition as your data sets grow or memory
> limitations shrink. This might be considered a first attempt, perhaps simply
> allowing us to default to this type of partitioning or the old version would
> be more compatible with existing users' needs? That would not be a hard
> feature to add to this. But I think this method of partition production has
> merit for typical large-scale graph data that Giraph is designed to process.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira