Steven Ruppert created SPARK-19098:
--------------------------------------

             Summary: Shuffled data leak/size doubling in 
ConnectedComponents/Pregel iterations
                 Key: SPARK-19098
                 URL: https://issues.apache.org/jira/browse/SPARK-19098
             Project: Spark
          Issue Type: Bug
          Components: GraphX
    Affects Versions: 2.1.0
         Environment: Linux x64
Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0)
Spark on YARN, dynamic allocation with shuffle service
Input/Output data on HDFS
kryo serialization turned on
checkpointing directory set on HDFS
            Reporter: Steven Ruppert
            Priority: Critical


I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla 
ConnectedComponents use, notably one that works fine with identical code on 
spark 2.0.1, but not on 2.1.0.

I unfortunately haven't narrowed this down to a test case yet, nor do I have 
access to the original logs, so this initial report will be a little vague. 
However, this behavior as described might ring a bell to somebody.

Roughly: 

{noformat}
val edges: RDD[Edge[Int]] = _ // from file
val vertices: RDD[(VertexId, Int)] = _ // from file
val graph = Graph(vertices, edges)

val components: RDD[(VertexId, ComponentId)] = ConnectedComponents
  .run(graph, 10)
  .vertices
{noformat}

Running this against my input of ~5B edges and ~3B vertices leads to a strange 
doubling of shuffle traffic in each round of Pregel (inside 
ConnectedComponents), increasing from the actual data size of ~50 GB, to 100GB, 
to 200GB, all the way to around 40TB before I killed the job. The data being 
shuffled was apparently an RDD of ShippableVertexPartition .

Oddly enough, only the kryo-serialized shuffled data doubled in size. The heap 
usage of the executors themselves remained stable, or at least did not account 
1 to 1 for the 40TB of shuffled data, for I definitely do not have 40TB of RAM. 
Furthermore, I also have kryo reference tracking turned on still, so whatever 
is leaking somehow gets around that.

I'll update this ticket once I have more details, unless somebody else with the 
same problem reports back first.










--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to