[
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yihong He updated SPARK-38353:
------------------------------
Description:
For example, for the following code:
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)
with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True,
False, True))
){code}
pandas usage logger records
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
since _{_}enter{_}_ and _{_}exit{_}_ methods of
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
are not instrumented.
So instrumenting __enter__ and __exit__ magic methods for Pandas module can
help improve accuracy of the usage data
was:
For example, for the following code:
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)
with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True,
False, True))
){code}
pandas usage logger records
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
since __enter__ and __exit__ methods of
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
are not instrumented.
> Instrument __enter__ and __exit__ magic methods for Pandas module
> -----------------------------------------------------------------
>
> Key: SPARK-38353
> URL: https://issues.apache.org/jira/browse/SPARK-38353
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.2.1
> Reporter: Yihong He
> Priority: Minor
>
> For example, for the following code:
>
> {code:java}
> pdf = pd.DataFrame(
> [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
> self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
> self.assert_eq(
> repr(cached_df.spark.storage_level), repr(StorageLevel(True, True,
> False, True))
> ){code}
>
> pandas usage logger records
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
> since _{_}enter{_}_ and _{_}exit{_}_ methods of
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
> are not instrumented.
> So instrumenting __enter__ and __exit__ magic methods for Pandas module can
> help improve accuracy of the usage data
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]