Ankur Dave created SPARK-1987:
---------------------------------
Summary: More memory-efficient graph construction
Key: SPARK-1987
URL: https://issues.apache.org/jira/browse/SPARK-1987
Project: Spark
Issue Type: Improvement
Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
A graph's edges are usually the largest component of the graph. GraphX
currently stores edges in parallel primitive arrays, so each edge should only
take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the
current implementation in EdgePartitionBuilder uses an array of Edge objects as
an intermediate representation for sorting, so each edge additionally takes
about 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) +
uncompressed pointer (8) + object overhead (8) + padding (4)). This
unnecessarily increases GraphX's memory requirements by a factor of 3.
To save memory, EdgePartitionBuilder should instead use a custom sort routine
that operates directly on the three parallel arrays.
--
This message was sent by Atlassian JIRA
(v6.2#6252)