Steven Ruppert created SPARK-19098:
--------------------------------------
Summary: Shuffled data leak/size doubling in
ConnectedComponents/Pregel iterations
Key: SPARK-19098
URL: https://issues.apache.org/jira/browse/SPARK-19098
Project: Spark
Issue Type: Bug
Components: GraphX
Affects Versions: 2.1.0
Environment: Linux x64
Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0)
Spark on YARN, dynamic allocation with shuffle service
Input/Output data on HDFS
kryo serialization turned on
checkpointing directory set on HDFS
Reporter: Steven Ruppert
Priority: Critical
I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla
ConnectedComponents use, notably one that works fine with identical code on
spark 2.0.1, but not on 2.1.0.
I unfortunately haven't narrowed this down to a test case yet, nor do I have
access to the original logs, so this initial report will be a little vague.
However, this behavior as described might ring a bell to somebody.
Roughly:
{noformat}
val edges: RDD[Edge[Int]] = _ // from file
val vertices: RDD[(VertexId, Int)] = _ // from file
val graph = Graph(vertices, edges)
val components: RDD[(VertexId, ComponentId)] = ConnectedComponents
.run(graph, 10)
.vertices
{noformat}
Running this against my input of ~5B edges and ~3B vertices leads to a strange
doubling of shuffle traffic in each round of Pregel (inside
ConnectedComponents), increasing from the actual data size of ~50 GB, to 100GB,
to 200GB, all the way to around 40TB before I killed the job. The data being
shuffled was apparently an RDD of ShippableVertexPartition .
Oddly enough, only the kryo-serialized shuffled data doubled in size. The heap
usage of the executors themselves remained stable, or at least did not account
1 to 1 for the 40TB of shuffled data, for I definitely do not have 40TB of RAM.
Furthermore, I also have kryo reference tracking turned on still, so whatever
is leaking somehow gets around that.
I'll update this ticket once I have more details, unless somebody else with the
same problem reports back first.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]