I’m not sure about 1.2TB, but I can give it a shot.

Is there some way to persist intermediate results to disk? Does all the graph 
has to be in memory?

Alex

From: Ankur Dave [mailto:ankurd...@gmail.com]
Sent: Monday, May 26, 2014 12:23 AM
To: user@spark.apache.org
Subject: Re: GraphX partition problem

Once the graph is built, edges are stored in parallel primitive arrays, so each 
edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). 
Unfortunately, the current implementation in EdgePartitionBuilder uses an array 
of Edge objects as an intermediate representation for sorting, so each edge 
will additionally take something like 40 bytes during graph construction (srcId 
(8) + dstId (8) + attr (4) + uncompressed pointer (8) + object overhead (8) + 
padding (4)). So you'll need about 1.2 TB of memory in total (60 bytes * 20 
billion). Does your cluster have this much memory?

If not, I've been meaning to write a custom sort routine that can operate 
directly on the three parallel arrays. This will reduce the memory requirements 
to about 400 GB.

Ankur<http://www.ankurdave.com/>

Reply via email to