itholic opened a new pull request #34389:
URL: https://github.com/apache/spark/pull/34389


   ### What changes were proposed in this pull request?
   
   This PR proposes to add an util function to raise advice warning for pandas 
API on Spark.
   
   Apart from the existing warnings recognized by general Python, PySpark and 
pandas users, these warnings are things to pay special attention to in the 
pandas API on Spark, so I think it is better to manage warnings separately.
   
   ### Why are the changes needed?
   
   The pandas API on Spark has functions that the existing pandas users who are 
not familiar with distributed environment should aware for avoiding confusion 
(not only confusion, it also could cause the serious performance degradation).
   
   For example:
   - `sort_index`, `len`, `sort_values`: such functions can cause the 
performance degradation since it goes through the entire data set
   - `to_xxx`, `read_xxx`: if the `index_col` is not specified for some I/O 
functions, the default index is attached which is expensive (and also the 
existing index will be lost)
   - `to_list`, `to_pandas`, `to_markdown`: such functions load the whole data 
into the driver's memory, so potentially could cause the OOM.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the pandas-on-Spark users can see the warning when they're using 
inefficient or potentially dangerous functions.
   
   
   ### How was this patch tested?
   
   Manually check the behavior one-by-one.
   
   ```python
   >>> import pyspark.pandas as ps
   >>> psser = ps.Series([1, 2, 3, 4])
   >>> psser.to_list()
   .../spark/python/pyspark/pandas/utils.py:968: PandasAPIOnSparkAdviceWarning: 
`to_list` loads the all data into the driver's memory. It should only be used 
if the resulting list is expected to be small.
     warnings.warn(message, PandasAPIOnSparkAdviceWarning)
   [1, 2, 3, 4]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to