[ 
https://issues.apache.org/jira/browse/SPARK-13460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159641#comment-15159641
 ] 

Adnan Haider commented on SPARK-13460:
--------------------------------------

I am currently working on submitting a pull request for run-length encoding 
local source ids. 

> Applying Encoding methods to GraphX's Internal storage structure
> ----------------------------------------------------------------
>
>                 Key: SPARK-13460
>                 URL: https://issues.apache.org/jira/browse/SPARK-13460
>             Project: Spark
>          Issue Type: Improvement
>          Components: GraphX
>            Reporter: Adnan Haider
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Currently, the memory consumption of graphs is more than 2x the graph's 
> on-disk size. For example, loading the Friendster dataset consumes 32.3 GB on 
> disk but 85.1 GB after loading it into GraphX. Some of this memory 
> consumption can be reduced by using [delta 
> encoding|https://en.wikipedia.org/wiki/Delta_encoding] and [run-length 
> encoding|https://en.wikipedia.org/wiki/Run-length_encoding]. 
> For example, we can use run-length encoding on the localSrcIds. Instead, of 
> storing a local source id for each edge, we can store a map from each local 
> source id to a tuple of offset, length pairs, which index into localDstIds:
> {code:title=EdgePartitionBuilder.scala}
>     val index = new GraphXPrimitiveKeyOpenHashMap[Int, (Int, Int)]
> {code}
> This method saves about 17% of space for the [Friendster 
> dataset|https://snap.stanford.edu/data/com-Friendster.html] or around 14 GB. 
> It also saves 14% for the [orkut 
> dataset|https://snap.stanford.edu/data/com-Orkut.html].
> In addition, we could use delta encoding on the localDstIds. After sorting 
> the local destination ids for each source vertex and then apply the encoding, 
> for Friendster half of the ids can be represented in a single byte.
>  
> Also, currently GraphX stores multiple copies of global vertex ids for 
> different mapping structures. Depending on the graph partitioning strategy 
> and how the edge list file is stored on disk, the global vertex ids can be 
> close to each other, meaning they are a good candidate for delta encoding. 
> Collectively these features provide around 50% savings for the Friendster 
> dataset. These savings come from a prototype version I have worked on. Some 
> of these savings requires a great deal of modification to internal GraphX 
> code, since algorithms need to be changed to limit the overhead of decoding 
> data. For example, below is a snippet from the modified filter operation in 
> EdgePartition.scala:
> {code:title=Filter}
>     index.iterator.foreach { cluster =>
>        val clusterLocalSrcId = cluster._1
>        val dstOffset = cluster._2._1
>        val dstLen = cluster._2._2
>       // The user sees the EdgeTriplet, so we can't reuse it and must create 
> one per edge.
>       val localSrcId = clusterLocalSrcId
>       var i = 0
>       while(i < dstLen){
>         val localDstId = localDstIds(i + dstOffset)
> {code}
> This can be a long term design change and will help address some recent 
> [results|http://www.vldb.org/pvldb/vol8/p161-bu.pdf] that GraphX has trouble 
> loading large datasets.  I have a discussion here on the [dev 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html]
>  where we can discuss what encoding features are worth adding. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to