guykhazma commented on pull request #28826:
URL: https://github.com/apache/spark/pull/28826#issuecomment-647571832


   @viirya `derivedFromAtt` is set to `false` when the expression is requesting 
a nested field .
   The metadata for a nested column is not preserved (also in Spark 2.4) so I 
am not sure what the expected behaviour should be here. 
   Note that the metadata for nested field also not preserved when using:
   ```Scala
   df.select("col_a.name").schema
   ```
   when `df` is the dataframe that was created locally with the specified 
schema.
   If this is considered a bug then we can resolve this as well (will require 
some more changes).
   
   For example this test triggers a code path where `derivedFromAtt` is `false` 
but it currently passes since metadata is not preserved for nested columns:
   
   ```Scala
     test("SPARK-31988 - make sure schema metadata is preserved - nested 
schema") {
       withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, 
"avro,csv,json,kafka,orc,text,parquet")) {
         withTempPath{ f =>
           // create custom dataset with schema metadata
           val data = Seq(
             Row(Row("a", 45), "b")
           )
           val schema = List(
             StructField("col_a", StructType(
               List(
                 StructField("name", StringType, true,
                   new MetadataBuilder().putString("check", "b").build()),
                 StructField("age", IntegerType, true)
               )
             ), true,
               new MetadataBuilder().putString("key", "value").build()),
             StructField("col_b", StringType, true)
           )
   
           val df = spark.createDataFrame(
             spark.sparkContext.parallelize(data),
             StructType(schema)
           )
           df.write.parquet(f.getAbsolutePath)
   
           // read from storage
           val readDF = spark.read.parquet(f.getAbsolutePath)
           // write again
           withTempPath { f =>
             readDF.select("col_a.name").write.parquet(f.getAbsolutePath)
             // read again and verify the schema is equal (including the 
metadata)
             val readDF2 = spark.read.parquet(f.getAbsolutePath)
             assert(readDF.select("col_a.name").schema == readDF2.schema)
           }
         }
       }
     }
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to