Hi, I have been looking through the GraphX source code, dissecting the reason for its high memory consumption compared to the on-disk size of the graph. I have found that there may be room to reduce the memory footprint of the graph structures. I think the biggest savings can come from the localSrcIds and localDstIds in EdgePartitions.
In particular, instead of storing both a source and destination local ID for each edge, we could store only the destination id. For example after sorting edges by global source id, we can map each of the source vertices first to local values followed by unmapped global destination ids. This would make localSrcIds sorted starting from 0 to n, where n is the number of distinct global source ids. Then instead of actually storing the local source id for each edge, we can store an array of size n, with each element storing an index into localDstIds. From my understanding, this would also eliminate the need for storing an index for indexed scanning, since each element in localSrcIds would be the start of a cluster. From some extensive testing, this along with some delta encoding strategies on localDstIds and the mapping structures can reduce memory consumption of the graph by nearly half. However, I am not entirely sure if there is any reason for storing both localSrcIds and localDstIds for each edge in terms of integration of future functionalities, such as graph mutations. I noticed there was another post similar to this one as well, but it had not replies. The idea is quite similar to Netflix graph library <https://github.com/Netflix/netflix-graph> and would be happy to open a jira on this issue with partial improvements. But, I may not be completely correct with my thinking! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org