[
https://issues.apache.org/jira/browse/SPARK-35876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan updated SPARK-35876:
--------------------------------
Fix Version/s: 3.0.4
3.1.3
> array_zip unexpected column names
> ---------------------------------
>
> Key: SPARK-35876
> URL: https://issues.apache.org/jira/browse/SPARK-35876
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.1.2
> Reporter: Derk Crezee
> Assignee: Kousuke Saruta
> Priority: Major
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> {{When I'm using the array_zip function in combination with renamed columns,
> I get an unexpected schema written to disk.}}
> {code:java}
> // code placeholder
> from pyspark.sql import *
> from pyspark.sql.functions import *
> spark = SparkSession.builder.getOrCreate()
> data = [
> Row(a1=["a", "a"], b1=["b", "b"]),
> ]
> df = (
> spark.sparkContext.parallelize(data).toDF()
> .withColumnRenamed("a1", "a2")
> .withColumnRenamed("b1", "b2")
> .withColumn("zipped", arrays_zip(col("a2"), col("b2")))
> )
> df.printSchema()
> // root
> // |-- a2: array (nullable = true)
> // | |-- element: string (containsNull = true)
> // |-- b2: array (nullable = true)
> // | |-- element: string (containsNull = true)
> // |-- zipped: array (nullable = true)
> // | |-- element: struct (containsNull = false)
> // | | |-- a2: string (nullable = true)
> // | | |-- b2: string (nullable = true)
> df.write.save("test.parquet")
> spark.read.load("test.parquet").printSchema()
> // root
> // |-- a2: array (nullable = true)
> // | |-- element: string (containsNull = true)
> // |-- b2: array (nullable = true)
> // | |-- element: string (containsNull = true)
> // |-- zipped: array (nullable = true)
> // | |-- element: struct (containsNull = true)
> // | | |-- a1: string (nullable = true)
> // | | |-- b1: string (nullable = true){code}
> I would expect the schema of the DataFrame written to disk to be the same as
> that printed out. It seems that instead of using the renamed version of the
> column names, it uses the old column names.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]