[
https://issues.apache.org/jira/browse/SPARK-16921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17925432#comment-17925432
]
Nicholas Chammas commented on SPARK-16921:
------------------------------------------
I'd like to revisit this feature idea and contribute a solution.
[~holden] and [~gurwls223] (since you both reviewed the associated PR): Do you
still like the solution [proposed
here|https://github.com/apache/spark/pull/14579#discussion_r74811199]?
> RDD/DataFrame persist() and cache() should return Python context managers
> -------------------------------------------------------------------------
>
> Key: SPARK-16921
> URL: https://issues.apache.org/jira/browse/SPARK-16921
> Project: Spark
> Issue Type: New Feature
> Components: PySpark, Spark Core, SQL
> Reporter: Nicholas Chammas
> Priority: Minor
> Labels: bulk-closed
>
> [Context
> managers|https://docs.python.org/3/reference/datamodel.html#context-managers]
> are a natural way to capture closely related setup and teardown code in
> Python.
> For example, they are commonly used when doing file I/O:
> {code}
> with open('/path/to/file') as f:
> contents = f.read()
> ...
> {code}
> Once the program exits the with block, {{f}} is automatically closed.
> I think it makes sense to apply this pattern to persisting and unpersisting
> DataFrames and RDDs. There are many cases when you want to persist a
> DataFrame for a specific set of operations and then unpersist it immediately
> afterwards.
> For example, take model training. Today, you might do something like this:
> {code}
> labeled_data.persist()
> model = pipeline.fit(labeled_data)
> labeled_data.unpersist()
> {code}
> If {{persist()}} returned a context manager, you could rewrite this as
> follows:
> {code}
> with labeled_data.persist():
> model = pipeline.fit(labeled_data)
> {code}
> Upon exiting the {{with}} block, {{labeled_data}} would automatically be
> unpersisted.
> This can be done in a backwards-compatible way since {{persist()}} would
> still return the parent DataFrame or RDD as it does today, but add two
> methods to the object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]