[
https://issues.apache.org/jira/browse/SPARK-13460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159641#comment-15159641
]
Adnan Haider commented on SPARK-13460:
--------------------------------------
I am currently working on submitting a pull request for run-length encoding
local source ids.
> Applying Encoding methods to GraphX's Internal storage structure
> ----------------------------------------------------------------
>
> Key: SPARK-13460
> URL: https://issues.apache.org/jira/browse/SPARK-13460
> Project: Spark
> Issue Type: Improvement
> Components: GraphX
> Reporter: Adnan Haider
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Currently, the memory consumption of graphs is more than 2x the graph's
> on-disk size. For example, loading the Friendster dataset consumes 32.3 GB on
> disk but 85.1 GB after loading it into GraphX. Some of this memory
> consumption can be reduced by using [delta
> encoding|https://en.wikipedia.org/wiki/Delta_encoding] and [run-length
> encoding|https://en.wikipedia.org/wiki/Run-length_encoding].
> For example, we can use run-length encoding on the localSrcIds. Instead, of
> storing a local source id for each edge, we can store a map from each local
> source id to a tuple of offset, length pairs, which index into localDstIds:
> {code:title=EdgePartitionBuilder.scala}
> val index = new GraphXPrimitiveKeyOpenHashMap[Int, (Int, Int)]
> {code}
> This method saves about 17% of space for the [Friendster
> dataset|https://snap.stanford.edu/data/com-Friendster.html] or around 14 GB.
> It also saves 14% for the [orkut
> dataset|https://snap.stanford.edu/data/com-Orkut.html].
> In addition, we could use delta encoding on the localDstIds. After sorting
> the local destination ids for each source vertex and then apply the encoding,
> for Friendster half of the ids can be represented in a single byte.
>
> Also, currently GraphX stores multiple copies of global vertex ids for
> different mapping structures. Depending on the graph partitioning strategy
> and how the edge list file is stored on disk, the global vertex ids can be
> close to each other, meaning they are a good candidate for delta encoding.
> Collectively these features provide around 50% savings for the Friendster
> dataset. These savings come from a prototype version I have worked on. Some
> of these savings requires a great deal of modification to internal GraphX
> code, since algorithms need to be changed to limit the overhead of decoding
> data. For example, below is a snippet from the modified filter operation in
> EdgePartition.scala:
> {code:title=Filter}
> index.iterator.foreach { cluster =>
> val clusterLocalSrcId = cluster._1
> val dstOffset = cluster._2._1
> val dstLen = cluster._2._2
> // The user sees the EdgeTriplet, so we can't reuse it and must create
> one per edge.
> val localSrcId = clusterLocalSrcId
> var i = 0
> while(i < dstLen){
> val localDstId = localDstIds(i + dstOffset)
> {code}
> This can be a long term design change and will help address some recent
> [results|http://www.vldb.org/pvldb/vol8/p161-bu.pdf] that GraphX has trouble
> loading large datasets. I have a discussion here on the [dev
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html]
> where we can discuss what encoding features are worth adding.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]