[jira] [Updated] (SPARK-35876) array_zip unexpected column names

Wenchen Fan (Jira) Tue, 24 Aug 2021 08:35:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-35876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wenchen Fan updated SPARK-35876:
--------------------------------
    Fix Version/s: 3.0.4
                   3.1.3

> array_zip unexpected column names
> ---------------------------------
>
>                 Key: SPARK-35876
>                 URL: https://issues.apache.org/jira/browse/SPARK-35876
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.1.2
>            Reporter: Derk Crezee
>            Assignee: Kousuke Saruta
>            Priority: Major
>             Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> {{When I'm using the array_zip function in combination with renamed columns, 
> I get an unexpected schema written to disk.}}
> {code:java}
> // code placeholder
> from pyspark.sql import * 
> from pyspark.sql.functions import *
> spark = SparkSession.builder.getOrCreate()
> data = [
>   Row(a1=["a", "a"], b1=["b", "b"]),
> ]
> df = (
>   spark.sparkContext.parallelize(data).toDF()
>     .withColumnRenamed("a1", "a2")
>     .withColumnRenamed("b1", "b2")
>     .withColumn("zipped", arrays_zip(col("a2"), col("b2")))
> )
> df.printSchema()
> // root
> //  |-- a2: array (nullable = true)
> //  |    |-- element: string (containsNull = true)
> //  |-- b2: array (nullable = true)
> //  |    |-- element: string (containsNull = true)
> //  |-- zipped: array (nullable = true)
> //  |    |-- element: struct (containsNull = false)
> //  |    |    |-- a2: string (nullable = true)
> //  |    |    |-- b2: string (nullable = true)
> df.write.save("test.parquet")
> spark.read.load("test.parquet").printSchema()
> // root
> //  |-- a2: array (nullable = true)
> //  |    |-- element: string (containsNull = true)
> //  |-- b2: array (nullable = true)
> //  |    |-- element: string (containsNull = true)
> //  |-- zipped: array (nullable = true)
> //  |    |-- element: struct (containsNull = true)
> //  |    |    |-- a1: string (nullable = true)
> //  |    |    |-- b1: string (nullable = true){code}
> I would expect the schema of the DataFrame written to disk to be the same as 
> that printed out. It seems that instead of using the renamed version of the 
> column names, it uses the old column names.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-35876) array_zip unexpected column names

Reply via email to