[ 
https://issues.apache.org/jira/browse/SPARK-35480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Bryant updated SPARK-35480:
---------------------------------------
    Description: 
The percentile_approx PySpark function does not appear to treat the "accuracy" 
parameter correctly when pivoting on a column, causing the query below to fail 
(this also fails if the accuracy parameter is left unspecified):
----
{{import pyspark.sql.functions as F}}

{{df = sc.parallelize([}}
 {{    ["a", -1.0],}}
 {{    ["a", 5.5],}}
 {{    ["a", 2.5],}}
 {{    ["b", 3.0],}}
 {{    ["b", 5]}}
{{]).toDF(["type", "value"])}}
{{    .groupBy()}}
{{    .pivot("type", ["a", "b"])}}
 {{    .agg(F.percentile_approx("value", [0.5], 10000).alias("percentiles"))}}
----
Error message: 

{{AnalysisException: cannot resolve 'percentile_approx((IF((`type` <=> CAST('a' 
AS STRING)), `value`, CAST(NULL AS DOUBLE))), (IF((`type` <=> CAST('a' AS 
STRING)), array(0.5D), NULL)), (IF((`type` <=> CAST('a' AS STRING)), 10000, 
CAST(NULL AS INT))))' due to data type mismatch: The accuracy or percentage 
provided must be a constant literal; 'Aggregate [percentile_approx(if 
((type#242 <=> cast(a as string))) value#243 else cast(null as double), if 
((type#242 <=> cast(a as string))) array(0.5) else cast(null as array<double>), 
if ((type#242 <=> cast(a as string))) 10000 else cast(null as int), 0, 0) AS 
a#251, percentile_approx(if ((type#242 <=> cast(b as string))) value#243 else 
cast(null as double), if ((type#242 <=> cast(b as string))) array(0.5) else 
cast(null as array<double>), if ((type#242 <=> cast(b as string))) 10000 else 
cast(null as int), 0, 0) AS b#253|#242 <=> cast(a as string))) value#243 else 
cast(null as double), if ((type#242 <=> cast(a as string))) array(0.5) else 
cast(null as array<double>), if ((type#242 <=> cast(a as string))) 10000 else 
cast(null as int), 0, 0) AS a#251, percentile_approx(if ((type#242 <=> cast(b 
as string))) value#243 else cast(null as double), if ((type#242 <=> cast(b as 
string))) array(0.5) else cast(null as array<double>), if ((type#242 <=> cast(b 
as string))) 10000 else cast(null as int), 0, 0) AS b#253] +- LogicalRDD 
[type#242, value#243|#242, value#243], false}}

 

  was:
The percentile_approx PySpark function does not appear to treat the "accuracy" 
parameter correctly when pivoting on a column, causing the query below to fail 
(this also fails if the accuracy parameter is left unspecified):
----
{{{{import pyspark.sql.functions as F}}}}

{{df = sc.parallelize([}}
{{    ["a", -1.0],}}
{{    ["a", 5.5],}}
{{    ["a", 2.5],}}
{{    ["b", 3.0],}}
{{    ["b", 5]}}
{{]).toDF(["type", "value"]) \}}
{{    .groupBy() \}}
{{    .pivot("type", ["a", "b"]) \}}
{{    .agg(F.percentile_approx("value", [0.5], 10000).alias("percentiles"))}}
----
Error message: 

{{AnalysisException: cannot resolve 'percentile_approx((IF((`type` <=> CAST('a' 
AS STRING)), `value`, CAST(NULL AS DOUBLE))), (IF((`type` <=> CAST('a' AS 
STRING)), array(0.5D), NULL)), (IF((`type` <=> CAST('a' AS STRING)), 10000, 
CAST(NULL AS INT))))' due to data type mismatch: The accuracy or percentage 
provided must be a constant literal; 'Aggregate [percentile_approx(if 
((type#242 <=> cast(a as string))) value#243 else cast(null as double), if 
((type#242 <=> cast(a as string))) array(0.5) else cast(null as array<double>), 
if ((type#242 <=> cast(a as string))) 10000 else cast(null as int), 0, 0) AS 
a#251, percentile_approx(if ((type#242 <=> cast(b as string))) value#243 else 
cast(null as double), if ((type#242 <=> cast(b as string))) array(0.5) else 
cast(null as array<double>), if ((type#242 <=> cast(b as string))) 10000 else 
cast(null as int), 0, 0) AS b#253] +- LogicalRDD [type#242, value#243], false}}

 


> percentile_approx function doesn't work with pivot
> --------------------------------------------------
>
>                 Key: SPARK-35480
>                 URL: https://issues.apache.org/jira/browse/SPARK-35480
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.1.1
>            Reporter: Christopher Bryant
>            Priority: Major
>
> The percentile_approx PySpark function does not appear to treat the 
> "accuracy" parameter correctly when pivoting on a column, causing the query 
> below to fail (this also fails if the accuracy parameter is left unspecified):
> ----
> {{import pyspark.sql.functions as F}}
> {{df = sc.parallelize([}}
>  {{    ["a", -1.0],}}
>  {{    ["a", 5.5],}}
>  {{    ["a", 2.5],}}
>  {{    ["b", 3.0],}}
>  {{    ["b", 5]}}
> {{]).toDF(["type", "value"])}}
> {{    .groupBy()}}
> {{    .pivot("type", ["a", "b"])}}
>  {{    .agg(F.percentile_approx("value", [0.5], 10000).alias("percentiles"))}}
> ----
> Error message: 
> {{AnalysisException: cannot resolve 'percentile_approx((IF((`type` <=> 
> CAST('a' AS STRING)), `value`, CAST(NULL AS DOUBLE))), (IF((`type` <=> 
> CAST('a' AS STRING)), array(0.5D), NULL)), (IF((`type` <=> CAST('a' AS 
> STRING)), 10000, CAST(NULL AS INT))))' due to data type mismatch: The 
> accuracy or percentage provided must be a constant literal; 'Aggregate 
> [percentile_approx(if ((type#242 <=> cast(a as string))) value#243 else 
> cast(null as double), if ((type#242 <=> cast(a as string))) array(0.5) else 
> cast(null as array<double>), if ((type#242 <=> cast(a as string))) 10000 else 
> cast(null as int), 0, 0) AS a#251, percentile_approx(if ((type#242 <=> cast(b 
> as string))) value#243 else cast(null as double), if ((type#242 <=> cast(b as 
> string))) array(0.5) else cast(null as array<double>), if ((type#242 <=> 
> cast(b as string))) 10000 else cast(null as int), 0, 0) AS b#253|#242 <=> 
> cast(a as string))) value#243 else cast(null as double), if ((type#242 <=> 
> cast(a as string))) array(0.5) else cast(null as array<double>), if 
> ((type#242 <=> cast(a as string))) 10000 else cast(null as int), 0, 0) AS 
> a#251, percentile_approx(if ((type#242 <=> cast(b as string))) value#243 else 
> cast(null as double), if ((type#242 <=> cast(b as string))) array(0.5) else 
> cast(null as array<double>), if ((type#242 <=> cast(b as string))) 10000 else 
> cast(null as int), 0, 0) AS b#253] +- LogicalRDD [type#242, value#243|#242, 
> value#243], false}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to