[
https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Felix Altenberg updated SPARK-39070:
------------------------------------
Description:
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]
df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value",
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.
Having even one less explode (by manually transforming the input dataframe to
what it would look like after the first explode) causes this error not to
appear.
Adding a .cache() before the groupBy and pivot also causes this error to
dissapear.
Strangely, this error does not appear when running the above code in Databricks
using Databricks Runtime 10.4 which is also using spark version 3.2.1
I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I
find.
was:
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]
df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value",
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.
Having even one less explode (by manually transforming the input dataframe to
what it would look like after the first explode) causes this error not to
appear.
Adding a .cache() before the groupBy and pivot also causes this error to
dissapear.
I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I
find.
> Pivoting a dataframe after two explodes raises an
> org.codehaus.commons.compiler.CompileException
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-39070
> URL: https://issues.apache.org/jira/browse/SPARK-39070
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
> MacOS 12.3.1, python versions 3.8.10, 3.9.12
> Ubuntu 20.04, python version 3.8.10
> Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda
> and pip
> Reporter: Felix Altenberg
> Priority: Major
> Attachments: spark_pivot_error.txt
>
>
> The following Code raises an exception starting in spark version 3.2.0
> {code:java}
> import pyspark.sql.functions as sf
> from pyspark.sql import Row
> data = [
> Row(
> other_value=10,
> ships=[
> Row(
> fields=[
> Row(name="field1", value=1),
> Row(name="field2", value=2),
> ]
> ),
> Row(
> fields=[
> Row(name="field1", value=3),
> Row(name="field3", value=4),
> ]
> ),
> ],
> )
> ]
> df = spark.createDataFrame(data)
> df = df.withColumn("ships", sf.explode("ships")).select(
> sf.col("other_value"), sf.col("ships.*")
> )
> df = df.withColumn("fields", sf.explode("fields")).select("other_value",
> "fields.*")
> df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
> The "pivot("name")" is what causes the error to be thrown.
> Having even one less explode (by manually transforming the input dataframe to
> what it would look like after the first explode) causes this error not to
> appear.
> Adding a .cache() before the groupBy and pivot also causes this error to
> dissapear.
> Strangely, this error does not appear when running the above code in
> Databricks using Databricks Runtime 10.4 which is also using spark version
> 3.2.1
> I have attached the full stacktrace of the error below.
> I will try to investigate more after work today and will add anything else I
> find.
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]