[ 
https://issues.apache.org/jira/browse/SPARK-38183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38183.
----------------------------------
    Fix Version/s: 3.3.0
       Resolution: Fixed

Issue resolved by pull request 35488
[https://github.com/apache/spark/pull/35488]

> Show warning when creating pandas-on-Spark session under ANSI mode.
> -------------------------------------------------------------------
>
>                 Key: SPARK-38183
>                 URL: https://issues.apache.org/jira/browse/SPARK-38183
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Haejoon Lee
>            Assignee: Haejoon Lee
>            Priority: Major
>             Fix For: 3.3.0
>
>
> Since pandas API on Spark follows the behavior of pandas, not SQL, some 
> unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.
> For example,
>  * It raises exception when {{div}} & {{mod}} related methods returns null 
> (e.g. {{{}DataFrame.rmod{}}})
> {code:java}
> >>> df
>    angels  degress
> 0       0      360
> 1       3      180
> 2       4      360
> >>> df.rmod(2)
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 32.0 (TID 165) (172.30.1.44 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.{code}
>  * It raises exception when DataFrame for {{ps.melt}} has not the same column 
> type.
>  
> {code:java}
> >>> df
>    A  B  C
> 0  a  1  2
> 1  b  3  4
> 2  c  5  6
> >>> ps.melt(df)
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), 
> struct('B', B), struct('C', C))' due to data type mismatch: input to function 
> array should all be the same type, but it's 
> [struct<variable:string,value:string>, struct<variable:string,value:bigint>, 
> struct<variable:string,value:bigint>]
> To fix the error, you might need to add explicit type casts. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.;
> 'Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> __natural_order__#231L, explode(array(struct(variable, A, value, A#224), 
> struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS 
> pairs#269]
> +- Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> monotonically_increasing_id() AS __natural_order__#231L]
>    +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code}
>  * It raises exception when {{CategoricalIndex.remove_categories}} doesn't 
> remove the entire index
> {code:java}
> >>> idx
> CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], 
> ordered=False, dtype='category')
> >>> idx.remove_categories('b')
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key b does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{CategoricalIndex.set_categories}} doesn't set 
> the entire index
> {code:java}
> >>> idx.set_categories(['b', 'c'])
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key a does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{ps.to_numeric}} get a non-numeric type
> {code:java}
> >>> psser
> 0    apple
> 1      1.0
> 2        2
> 3       -3
> dtype: object
> >>> ps.to_numeric(psser)
> 22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 
> 328)
> org.apache.spark.SparkNumberFormatException: invalid input syntax for type 
> numeric: apple. To return NULL instead, use 'try_cast'. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.
> ...{code}
>  * It raises exception when {{strings.StringMethods.rsplit}} - also 
> {{strings.StringMethods.split}} - with {{expand=True}} returns null columns
> {code:java}
> >>> s
> 0                       this is a regular sentence
> 1    https://docs.python.org/3/tutorial/index.html
> 2                                             None
> dtype: object
> >>> s.str.split(n=4, expand=True)
> 22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 
> 356)
> org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, 
> numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false 
> to bypass this error.{code}
>  * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and 
> the categories of {{CategoricalDtype}} is not matched with data.
> {code:java}
> >>> psser
> 0    1994-01-31
> 1    1994-02-01
> 2    1994-02-02
> dtype: object
> >>> cat_type
> CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
> >>> psser.astype(cat_type)
> 22/02/14 09:34:56 ERROR Executor: Exception in task 5.0 in stage 90.0 (TID 
> 468)
> org.apache.spark.SparkNoSuchElementException: Key 1994-02-01 does not exist. 
> If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.{code}
> Not only for the example cases, if the internal SQL function used to 
> implement the function has different behavior according to ANSI options, an 
> unexpected error may occur.
> So we might need to show proper warning message when creating pandas-on-Spark 
> session.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to