Good point.

Do you think it's sufficient to note this somewhere in the documentation
(or simply assume that user understanding of transformations vs. actions
means they know this), or are there other implications that need to be
considered?

On Fri, Aug 5, 2016 at 6:50 PM Koert Kuipers <ko...@tresata.com> wrote:

> The tricky part is that the action needs to be inside the with block, not
> just the transformation that uses the persisted data.
>
> On Aug 5, 2016 1:44 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com>
> wrote:
>
> Okie doke, I've filed a JIRA for this here:
> https://issues.apache.org/jira/browse/SPARK-16921
>
> On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin <r...@databricks.com> wrote:
>
>> Sounds like a great idea!
>>
>> On Friday, August 5, 2016, Nicholas Chammas <nicholas.cham...@gmail.com>
>> wrote:
>>
>>> Context managers
>>> <https://docs.python.org/3/reference/datamodel.html#context-managers>
>>> are a natural way to capture closely related setup and teardown code in
>>> Python.
>>>
>>> For example, they are commonly used when doing file I/O:
>>>
>>> with open('/path/to/file') as f:
>>>     contents = f.read()
>>>     ...
>>>
>>> Once the program exits the with block, f is automatically closed.
>>>
>>> Does it make sense to apply this pattern to persisting and unpersisting
>>> DataFrames and RDDs? I feel like there are many cases when you want to
>>> persist a DataFrame for a specific set of operations and then unpersist it
>>> immediately afterwards.
>>>
>>> For example, take model training. Today, you might do something like
>>> this:
>>>
>>> labeled_data.persist()
>>> model = pipeline.fit(labeled_data)
>>> labeled_data.unpersist()
>>>
>>> If persist() returned a context manager, you could rewrite this as
>>> follows:
>>>
>>> with labeled_data.persist():
>>>     model = pipeline.fit(labeled_data)
>>>
>>> Upon exiting the with block, labeled_data would automatically be
>>> unpersisted.
>>>
>>> This can be done in a backwards-compatible way since persist() would
>>> still return the parent DataFrame or RDD as it does today, but add two
>>> methods to the object: __enter__() and __exit__()
>>>
>>> Does this make sense? Is it attractive?
>>>
>>> Nick
>>> ​
>>>
>>
>

Reply via email to