[
https://issues.apache.org/jira/browse/SPARK-11161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961501#comment-14961501
]
Ryan Williams commented on SPARK-11161:
---------------------------------------
Digging in to the code a bit, I determined that this likely has to do with the
timing of driver GC cycles and is specific to RDDs which I am not retaining a
reference to in my {{spark-shell}}.
Is it intended that cached RDDs be unpersisted when user code no longer retains
a reference to them? That appears to be a necessary condition for what I'm
observing.
> Viewing the web UI for the first time unpersists a cached RDD
> -------------------------------------------------------------
>
> Key: SPARK-11161
> URL: https://issues.apache.org/jira/browse/SPARK-11161
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, Web UI
> Affects Versions: 1.5.1
> Reporter: Ryan Williams
> Priority: Minor
>
> This one is a real head-scratcher. [Here's a
> screencast|http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif]:
> !http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif!
> The three windows, left-to-right, are:
> * a {{spark-shell}} on YARN with dynamic allocation enabled, at rest with one
> executor. [Here's an example app's
> environment|https://gist.github.com/ryan-williams/6dd3502d5d0de2f030ac].
> * [Spree|https://github.com/hammerlab/spree], opened to the above app's
> "Storage" tab.
> * my YARN resource manager, showing a link to the web UI running on the
> driver.
> At the start, nothing has been run in the shell, and I've not visited the web
> UI.
> I run a simple job in the shell and cache a small RDD that it computes:
> {code}
> sc.parallelize(1 to 100000000, 100).map(_ % 100 -> 1).reduceByKey(_+_,
> 100).setName("foo").cache.count
> {code}
> As the second stage runs, you can see the partitions show up as cached in
> Spree.
> After the job finishes, a few requested executors continue to fill in, which
> you can see in the console at left or the nav bar of Spree in the middle.
> Once that has finished, everything is at rest with the RDD "foo" 100% cached.
> Then, I click the YARN RM's "ApplicationMaster" link which loads the web UI
> on the driver for the first time.
> Immediately, the console prints some activity, including that RDD 2 has been
> removed:
> {code}
> 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0
> on 172.29.46.15:33156 in memory (size: 1517.0 B, free: 7.2 GB)
> 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0
> on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1517.0 B,
> free: 12.2 GB)
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 2
> 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0
> on 172.29.46.15:33156 in memory (size: 1666.0 B, free: 7.2 GB)
> 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0
> on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1666.0 B,
> free: 12.2 GB)
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 1
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned shuffle 0
> 15/10/16 21:43:13 INFO storage.BlockManager: Removing RDD 2
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned RDD 2
> {code}
> Accordingly, Spree shows that the RDD has been unpersisted, and I can see in
> the event log (not pictured in the screencast) that an Unpersist event has
> made its way through the various SparkListeners:
> {code}
> {"Event":"SparkListenerUnpersistRDD","RDD ID":2}
> {code}
> Simply loading the web UI causes an RDD unpersist event to fire!
> I can't nail down exactly what's causing this, and I've seen evidence that
> there are other sequences of events that can also cause it:
> * I've repro'd the above steps ~20 times. The RDD always gets unpersisted
> when I've not visited the web UI until the RDD is cached, and when the app is
> dynamically allocating executors.
> * One time, I observed the unpersist to fire without my even visiting the web
> UI at all. Other times I wait a long time before visiting the web UI, so that
> it is clear that the loading of the web UI is causal, and it always is, but
> apparently there's another way for the unpersist to happen, seemingly rarely,
> without visiting the web UI.
> * I tried a couple of times without dynamic allocation and could not
> reproduce it.
> * I've tried a couple of times with dynamic allocation and starting with a
> higher minimum number of executors than 1 and have been unable to reproduce
> it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]