guykhazma opened a new pull request #28826:
URL: https://github.com/apache/spark/pull/28826
### What changes were proposed in this pull request?
Fixing the `getRootFields` function to preserve attribute metadata
### Why are the changes needed?
This can lead to a potential loss of metadata on an attribute in some code
paths.
For example - when reading a parquet file with a schema that has metadata
and writing it back the parquet footer it will not contain the metadata that
was originally there.
Simple code to reproduce (assuming datasource v2):
```Scala
// create custom dataset
val data = Seq(
Row("a", "b")
)
val schema = List(
StructField("col_a", StringType, true,
new sql.types.MetadataBuilder().putString("key", "value").build()),
StructField("col_b", StringType, true)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
// write
df.write.parquet("/tmp/check")
// read and verify the metadata exists
val readDF = spark.read.parquet("/tmp/check")
readDF.schema.foreach(s => println(s.metadata))
// write again
readDF.write.parquet("/tmp/check2")
// read again and verify no metadata
val readDF2 = spark.read.parquet("/tmp/check2")
readDF2.schema.foreach(s => println(s.metadata))
```
Note that this doesn't happen in datasource v1.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No tests were added as this is a minor change to a private function and no
tests exist to check this function so far.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]