Re: PySpark: Make persist() return a context manager

Reynold Xin Thu, 04 Aug 2016 23:09:07 -0700

Sounds like a great idea!

On Friday, August 5, 2016, Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:


> Context managers
> <https://docs.python.org/3/reference/datamodel.html#context-managers> are
> a natural way to capture closely related setup and teardown code in Python.
>
> For example, they are commonly used when doing file I/O:
>
> with open('/path/to/file') as f:
>     contents = f.read()
>     ...
>
> Once the program exits the with block, f is automatically closed.
>
> Does it make sense to apply this pattern to persisting and unpersisting
> DataFrames and RDDs? I feel like there are many cases when you want to
> persist a DataFrame for a specific set of operations and then unpersist it
> immediately afterwards.
>
> For example, take model training. Today, you might do something like this:
>
> labeled_data.persist()
> model = pipeline.fit(labeled_data)
> labeled_data.unpersist()
>
> If persist() returned a context manager, you could rewrite this as
> follows:
>
> with labeled_data.persist():
>     model = pipeline.fit(labeled_data)
>
> Upon exiting the with block, labeled_data would automatically be
> unpersisted.
>
> This can be done in a backwards-compatible way since persist() would
> still return the parent DataFrame or RDD as it does today, but add two
> methods to the object: __enter__() and __exit__()
>
> Does this make sense? Is it attractive?
>
> Nick
> 
>

Re: PySpark: Make persist() return a context manager

Reply via email to