[jira] [Commented] (SPARK-11161) Viewing the web UI for the first time unpersists a cached RDD

Ryan Williams (JIRA) Mon, 19 Oct 2015 09:22:02 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963556#comment-14963556
 ]


Ryan Williams commented on SPARK-11161:
---------------------------------------

bq. I expect RDDs to behave like JVM objects in this regard. I would not expect 
something to be hanging on to references to all my objects …

I understand this perspective; on the other hand, the JVM (afaik) doesn't 
expose APIs analogous to {{RDD.cache}} that say "explicitly keep this object 
around / in memory"; before looking at how this part of the code was actually 
implemented, I took {{cache}} to be a heavier-weight thing that would indeed 
mark an RDD as not subject to normal GC, at the very least.

It's interesting to me that you seem to imply that you sometimes rely on cached 
RDD's being {{unpersist}}'ed due to being GC'd, rather than manually 
{{unpersist}}'ing them yourself? (cf. "the GC that I want"). It just feels to 
me that if we are exposing this separate {{persist}}/{{unpersist}} API for 
RDDs, then also relying on nondeterministically-timed GC to come and trump the 
manual management of them via those APIs violates the principle of least 
surprise.

I guess my main question here is: how intentional/desirable is it that users 
"leaking" {{persist}}'ed RDDs (by not {{unpersist}}'ing them) get rescued, 
eventually, by GC?

bq. You can't unpersist RDDs from the web UI, though that would make sense as a 
feature. That's something different.

Sure, but it's pretty related to what we're talking about; wouldn't we need the 
driver to keep around references to cached RDDs to enable this? Would that 
prevent "the GC that you want"? Would it be worth it for such a feature?

Anyway, I can file the RDD-unpersist-via-web-UI as a separate JIRA if you 
prefer.

> Viewing the web UI for the first time unpersists a cached RDD
> -------------------------------------------------------------
>
>                 Key: SPARK-11161
>                 URL: https://issues.apache.org/jira/browse/SPARK-11161
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Web UI
>    Affects Versions: 1.5.1
>            Reporter: Ryan Williams
>            Priority: Minor
>
> This one is a real head-scratcher. [Here's a 
> screencast|http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif]:
> !http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif!
> The three windows, left-to-right, are: 
> * a {{spark-shell}} on YARN with dynamic allocation enabled, at rest with one 
> executor. [Here's an example app's 
> environment|https://gist.github.com/ryan-williams/6dd3502d5d0de2f030ac].
> * [Spree|https://github.com/hammerlab/spree], opened to the above app's 
> "Storage" tab.
> * my YARN resource manager, showing a link to the web UI running on the 
> driver.
> At the start, nothing has been run in the shell, and I've not visited the web 
> UI.
> I run a simple job in the shell and cache a small RDD that it computes:
> {code}
> sc.parallelize(1 to 100000000, 100).map(_ % 100 -> 1).reduceByKey(_+_, 
> 100).setName("foo").cache.count
> {code}
> As the second stage runs, you can see the partitions show up as cached in 
> Spree.
> After the job finishes, a few requested executors continue to fill in, which 
> you can see in the console at left or the nav bar of Spree in the middle.
> Once that has finished, everything is at rest with the RDD "foo" 100% cached.
> Then, I click the YARN RM's "ApplicationMaster" link which loads the web UI 
> on the driver for the first time.
> Immediately, the console prints some activity, including that RDD 2 has been 
> removed:
> {code}
> 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 
> on 172.29.46.15:33156 in memory (size: 1517.0 B, free: 7.2 GB)
> 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 
> on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1517.0 B, 
> free: 12.2 GB)
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 2
> 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 
> on 172.29.46.15:33156 in memory (size: 1666.0 B, free: 7.2 GB)
> 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 
> on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1666.0 B, 
> free: 12.2 GB)
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 1
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned shuffle 0
> 15/10/16 21:43:13 INFO storage.BlockManager: Removing RDD 2
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned RDD 2
> {code}
> Accordingly, Spree shows that the RDD has been unpersisted, and I can see in 
> the event log (not pictured in the screencast) that an Unpersist event has 
> made its way through the various SparkListeners:
> {code}
> {"Event":"SparkListenerUnpersistRDD","RDD ID":2}
> {code}
> Simply loading the web UI causes an RDD unpersist event to fire!
> I can't nail down exactly what's causing this, and I've seen evidence that 
> there are other sequences of events that can also cause it:
> * I've repro'd the above steps ~20 times. The RDD always gets unpersisted 
> when I've not visited the web UI until the RDD is cached, and when the app is 
> dynamically allocating executors.
> * One time, I observed the unpersist to fire without my even visiting the web 
> UI at all. Other times I wait a long time before visiting the web UI, so that 
> it is clear that the loading of the web UI is causal, and it always is, but 
> apparently there's another way for the unpersist to happen, seemingly rarely, 
> without visiting the web UI.
> * I tried a couple of times without dynamic allocation and could not 
> reproduce it.
> * I've tried a couple of times with dynamic allocation and starting with a 
> higher minimum number of executors than 1 and have been unable to reproduce 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11161) Viewing the web UI for the first time unpersists a cached RDD

Reply via email to