[PR] [SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for unsupported plotting functions [spark]

via GitHub Thu, 06 Jun 2024 20:17:27 -0700


zhengruifeng opened a new pull request, #46911:
URL: https://github.com/apache/spark/pull/46911


   ### What changes were proposed in this pull request?
   Throw `PandasNotImplementedError` for unsupported plotting functions:
   - {Frame, Series}.plot.hist
   - {Frame, Series}.plot.kde
   - {Frame, Series}.plot.density
   - {Frame, Series}.plot(kind="hist", ...)
   - {Frame, Series}.plot(kind="hist", ...)
   - {Frame, Series}.plot(kind="density", ...)
   
   
   ### Why are the changes needed?
   the previous error message is confusing:
   ```
   In [3]: psdf.plot.hist()
   /Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1017: 
PandasAPIOnSparkAdviceWarning: The config 'spark.sql.ansi.enabled' is set to 
True. This can cause unexpected behavior from pandas API on Spark since pandas 
API on Spark follows the behavior of pandas, not SQL.
     warnings.warn(message, PandasAPIOnSparkAdviceWarning)
   
[*********************************************-----------------------------------]
 57.14% Complete (0 Tasks running, 1s, 
Scanned[*********************************************-----------------------------------]
 57.14% Complete (0 Tasks running, 1s, 
Scanned[*********************************************-----------------------------------]
 57.14% Complete (0 Tasks running, 1s, Scanned                                  
                                                                                
              
---------------------------------------------------------------------------
   PySparkAttributeError                     Traceback (most recent call last)
   Cell In[3], line 1
   ----> 1 psdf.plot.hist()
   
   File ~/Dev/spark/python/pyspark/pandas/plot/core.py:951, in 
PandasOnSparkPlotAccessor.hist(self, bins, **kwds)
       903 def hist(self, bins=10, **kwds):
       904     """
       905     Draw one histogram of the DataFrame’s columns.
       906     A `histogram`_ is a representation of the distribution of data.
      (...)
       949         >>> df.plot.hist(bins=12, alpha=0.5)  # doctest: +SKIP
       950     """
   --> 951     return self(kind="hist", bins=bins, **kwds)
   
   File ~/Dev/spark/python/pyspark/pandas/plot/core.py:580, in 
PandasOnSparkPlotAccessor.__call__(self, kind, backend, **kwargs)
       577 kind = {"density": "kde"}.get(kind, kind)
       578 if hasattr(plot_backend, "plot_pandas_on_spark"):
       579     # use if there's pandas-on-Spark specific method.
   --> 580     return plot_backend.plot_pandas_on_spark(plot_data, kind=kind, 
**kwargs)
       581 else:
       582     # fallback to use pandas'
       583     if not PandasOnSparkPlotAccessor.pandas_plot_data_map[kind]:
   
   File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:41, in 
plot_pandas_on_spark(data, kind, **kwargs)
        39     return plot_pie(data, **kwargs)
        40 if kind == "hist":
   ---> 41     return plot_histogram(data, **kwargs)
        42 if kind == "box":
        43     return plot_box(data, **kwargs)
   
   File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:87, in 
plot_histogram(data, **kwargs)
        85 psdf, bins = HistogramPlotBase.prepare_hist_data(data, bins)
        86 assert len(bins) > 2, "the number of buckets must be higher than 2."
   ---> 87 output_series = HistogramPlotBase.compute_hist(psdf, bins)
        88 prev = float("%.9f" % bins[0])  # to make it prettier, truncate.
        89 text_bins = []
   
   File ~/Dev/spark/python/pyspark/pandas/plot/core.py:189, in 
HistogramPlotBase.compute_hist(psdf, bins)
       183 for group_id, (colname, bucket_name) in enumerate(zip(colnames, 
bucket_names)):
       184     # creates a Bucketizer to get corresponding bin of each value
       185     bucketizer = Bucketizer(
       186         splits=bins, inputCol=colname, outputCol=bucket_name, 
handleInvalid="skip"
       187     )
   --> 189     bucket_df = bucketizer.transform(sdf)
       191     if output_df is None:
       192         output_df = bucket_df.select(
       193             F.lit(group_id).alias("__group_id"), 
F.col(bucket_name).alias("__bucket")
       194         )
   
   File ~/Dev/spark/python/pyspark/ml/base.py:260, in 
Transformer.transform(self, dataset, params)
       258         return self.copy(params)._transform(dataset)
       259     else:
   --> 260         return self._transform(dataset)
       261 else:
       262     raise TypeError("Params must be a param map but got %s." % 
type(params))
   
   File ~/Dev/spark/python/pyspark/ml/wrapper.py:412, in 
JavaTransformer._transform(self, dataset)
       409 assert self._java_obj is not None
       411 self._transfer_params_to_java()
   --> 412 return DataFrame(self._java_obj.transform(dataset._jdf), 
dataset.sparkSession)
   
   File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:1696, in 
DataFrame.__getattr__(self, name)
      1694 def __getattr__(self, name: str) -> "Column":
      1695     if name in ["_jseq", "_jdf", "_jmap", "_jcols", "rdd", "toJSON"]:
   -> 1696         raise PySparkAttributeError(
      1697             error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", 
message_parameters={"attr_name": name}
      1698         )
      1700     if name not in self.columns:
      1701         raise PySparkAttributeError(
      1702             error_class="ATTRIBUTE_NOT_SUPPORTED", 
message_parameters={"attr_name": name}
      1703         )
   
   PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not 
supported in Spark Connect as it depends on the JVM. If you need to use this 
attribute, do not use Spark Connect when creating your session. Visit 
https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession
 for creating regular Spark Session in detail.
   ```
   
   after this PR:
   ```
   In [3]: psdf.plot.hist()
   ---------------------------------------------------------------------------
   PandasNotImplementedError                 Traceback (most recent call last)
   Cell In[3], line 1
   ----> 1 psdf.plot.hist()
   
   File ~/Dev/spark/python/pyspark/pandas/plot/core.py:957, in 
PandasOnSparkPlotAccessor.hist(self, bins, **kwds)
       909 """
       910 Draw one histogram of the DataFrame’s columns.
       911 A `histogram`_ is a representation of the distribution of data.
      (...)
       954     >>> df.plot.hist(bins=12, alpha=0.5)  # doctest: +SKIP
       955 """
       956 if is_remote():
   --> 957     return unsupported_function(class_name="pd.DataFrame", 
method_name="hist")()
       959 return self(kind="hist", bins=bins, **kwds)
   
   File ~/Dev/spark/python/pyspark/pandas/missing/__init__.py:23, in 
unsupported_function.<locals>.unsupported_function(*args, **kwargs)
        22 def unsupported_function(*args, **kwargs):
   ---> 23     raise PandasNotImplementedError(
        24         class_name=class_name, method_name=method_name, reason=reason
        25     )
   
   PandasNotImplementedError: The method `pd.DataFrame.hist()` is not 
implemented yet.
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, error message improvement
   
   
   ### How was this patch tested?
   CI
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for unsupported plotting functions [spark]

Reply via email to