[spark] branch master updated: [SPARK-40598][PS] Fix plotting features work properly with pandas 1.5.0

xinrong Thu, 06 Oct 2022 19:27:39 -0700

This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 6865309604a [SPARK-40598][PS] Fix plotting features work properly with 
pandas 1.5.0
6865309604a is described below

commit 6865309604a9986d902a6ff145f0855ee3fb7f8f
Author: itholic <haejoon....@databricks.com>
AuthorDate: Thu Oct 6 19:27:14 2022 -0700

    [SPARK-40598][PS] Fix plotting features work properly with pandas 1.5.0
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to fix the plotting functions working properly with pandas 
1.5.0.
    
    This includes two fixes:
    - Fix the `PandasOnSpark*Plot` to get name of plot in the string format 
properly.
    - Fix the default value of `subplots` parameter from `plot_frame` to match 
with latest pandas. (`None` -> `False`)
    
    ### Why are the changes needed?
    
    #### 1. get `_kind` from pandas class no longer possible.
    
    We're leverage the pandas plotting classes to implement for `matplotlib` 
implementation, and get the class name from pandas like:
    ```python
    >>> from pandas.plotting._matplotlib.core import AreaPlot
    >>> AreaPlot._kind
    'area'
    ```
    However, since pandas 1.5.0, they convert the member variable `_kind` into 
`property`, so we cannot bring the name of class properly from pandas class as 
below:
    ```python
    >>> from pandas.plotting._matplotlib.core import AreaPlot
    AreaPlot._kind
    >>> AreaPlot._kind
    <property object at 0x7fe520d749a0>
    ```
    
    #### 2. `subplots` parameter no longer allow the type other than `Iterable` 
or `bool`.
    
    We internally set the default value for `subplots` as `None`, but from 
pandas 1.5.0 only allows `Iterable` or `bool`, so the plotting function is not 
work properly as below:
    
    ```python
    >>> psdf.plot(kind="bar")
    Traceback (most recent call last):
    ...
    ValueError: subplots should be a bool or an iterable
    ```
    
    With this fixes, it work properly with pandas 1.5.0 as below:
    
    **<For Series and DataFrame plot>**
    
    **Before**:
    ```python
    >>> from pyspark.pandas.config import set_option
    >>> set_option("plotting.backend", "matplotlib")
    >>> import pyspark.pandas as ps
    >>> psdf = ps.range(10)
    >>> psdf.plot(kind="bar")
    Traceback (most recent call last):
    ...
    KeyError: 'bar'
    ```
    
    **After**:
    ```python
    >>> from pyspark.pandas.config import set_option
    >>> set_option("plotting.backend", "matplotlib")
    >>> import pyspark.pandas as ps
    >>> psdf = ps.range(10)
    >>> psdf.plot(kind="bar")
    <AxesSubplot:>
    ```
    
    **<For DataFrame plot>**
    
    **Before**:
    ```python
    >>> from pyspark.pandas.config import set_option
    >>> set_option("plotting.backend", "matplotlib")
    >>> import pyspark.pandas as ps
    >>> psdf = ps.range(10)
    >>> psdf.plot(kind="bar")
    Traceback (most recent call last):
    ...
    ValueError: subplots should be a bool or an iterable
    ```
    
    **After**:
    ```python
    >>> from pyspark.pandas.config import set_option
    >>> set_option("plotting.backend", "matplotlib")
    >>> import pyspark.pandas as ps
    >>> psdf = ps.range(10)
    >>> psdf.plot(kind="bar")
    <AxesSubplot:>
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Manually tested with pandas 1.5.0.
    
    Closes #38033 from itholic/fix_plot_test.
    
    Authored-by: itholic <haejoon....@databricks.com>
    Signed-off-by: Xinrong Meng <xinr...@apache.org>
---
 python/pyspark/pandas/plot/matplotlib.py | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/plot/matplotlib.py 
b/python/pyspark/pandas/plot/matplotlib.py
index 6f542061297..02938c7f292 100644
--- a/python/pyspark/pandas/plot/matplotlib.py
+++ b/python/pyspark/pandas/plot/matplotlib.py
@@ -50,6 +50,8 @@ _all_kinds = PlotAccessor._all_kinds  # type: 
ignore[attr-defined]
 
 
 class PandasOnSparkBarPlot(PandasBarPlot, TopNPlotBase):
+    _kind = "bar"
+
     def __init__(self, data, **kwargs):
         super().__init__(self.get_top_n(data), **kwargs)
 
@@ -59,6 +61,8 @@ class PandasOnSparkBarPlot(PandasBarPlot, TopNPlotBase):
 
 
 class PandasOnSparkBoxPlot(PandasBoxPlot, BoxPlotBase):
+    _kind = "box"
+
     def boxplot(
         self,
         ax,
@@ -354,6 +358,8 @@ class PandasOnSparkBoxPlot(PandasBoxPlot, BoxPlotBase):
 
 
 class PandasOnSparkHistPlot(PandasHistPlot, HistogramPlotBase):
+    _kind = "hist"
+
     def _args_adjust(self):
         if is_list_like(self.bottom):
             self.bottom = np.array(self.bottom)
@@ -413,6 +419,8 @@ class PandasOnSparkHistPlot(PandasHistPlot, 
HistogramPlotBase):
 
 
 class PandasOnSparkPiePlot(PandasPiePlot, TopNPlotBase):
+    _kind = "pie"
+
     def __init__(self, data, **kwargs):
         super().__init__(self.get_top_n(data), **kwargs)
 
@@ -422,6 +430,8 @@ class PandasOnSparkPiePlot(PandasPiePlot, TopNPlotBase):
 
 
 class PandasOnSparkAreaPlot(PandasAreaPlot, SampledPlotBase):
+    _kind = "area"
+
     def __init__(self, data, **kwargs):
         super().__init__(self.get_sampled(data), **kwargs)
 
@@ -431,6 +441,8 @@ class PandasOnSparkAreaPlot(PandasAreaPlot, 
SampledPlotBase):
 
 
 class PandasOnSparkLinePlot(PandasLinePlot, SampledPlotBase):
+    _kind = "line"
+
     def __init__(self, data, **kwargs):
         super().__init__(self.get_sampled(data), **kwargs)
 
@@ -440,6 +452,8 @@ class PandasOnSparkLinePlot(PandasLinePlot, 
SampledPlotBase):
 
 
 class PandasOnSparkBarhPlot(PandasBarhPlot, TopNPlotBase):
+    _kind = "barh"
+
     def __init__(self, data, **kwargs):
         super().__init__(self.get_top_n(data), **kwargs)
 
@@ -449,6 +463,8 @@ class PandasOnSparkBarhPlot(PandasBarhPlot, TopNPlotBase):
 
 
 class PandasOnSparkScatterPlot(PandasScatterPlot, TopNPlotBase):
+    _kind = "scatter"
+
     def __init__(self, data, x, y, **kwargs):
         super().__init__(self.get_top_n(data), x, y, **kwargs)
 
@@ -458,6 +474,8 @@ class PandasOnSparkScatterPlot(PandasScatterPlot, 
TopNPlotBase):
 
 
 class PandasOnSparkKdePlot(PandasKdePlot, KdePlotBase):
+    _kind = "kde"
+
     def _compute_plot_data(self):
         self.data = KdePlotBase.prepare_kde_data(self.data)
 
@@ -707,7 +725,7 @@ def plot_frame(
     y=None,
     kind="line",
     ax=None,
-    subplots=None,
+    subplots=False,
     sharex=None,
     sharey=False,
     layout=None,


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-40598][PS] Fix plotting features work properly with pandas 1.5.0

Reply via email to