I’m not sure about 1.2TB, but I can give it a shot. Is there some way to persist intermediate results to disk? Does all the graph has to be in memory?
Alex From: Ankur Dave [mailto:ankurd...@gmail.com] Sent: Monday, May 26, 2014 12:23 AM To: user@spark.apache.org Subject: Re: GraphX partition problem Once the graph is built, edges are stored in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). Unfortunately, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for sorting, so each edge will additionally take something like 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). So you'll need about 1.2 TB of memory in total (60 bytes * 20 billion). Does your cluster have this much memory? If not, I've been meaning to write a custom sort routine that can operate directly on the three parallel arrays. This will reduce the memory requirements to about 400 GB. Ankur<http://www.ankurdave.com/>