comphead commented on issue #1789:
URL:
https://github.com/apache/datafusion-comet/issues/1789#issuecomment-2920846368
So the issue goes down to the reader, the problem is Spark can understand on
fly parquet schema whereas DataFusion still static. To reproduce the problem
Create some data
```
val q = """
| select map(str0, str1) c0 from
| (
| select named_struct('a', cast(3 as long), 'b', cast(4 as
long), 'c', cast(5 as long)) str0,
| named_struct('x', cast(6 as long), 'y', 'abc', 'z',
cast(8 as long)) str1 union all
| select named_struct('a', cast(31 as long), 'b', cast(41 as
long), 'c', cast(51 as long)), null
| )
|""".stripMargin
spark.sql(q).repartition(1).write.parquet("/tmp/t1")
```
Check parquet file metadata. The file meta shows both the fields `x, y, z`
are required, but entire group `value` is optional. This is quite weird. Spark
schema in properties shows `value` is not nullable although it contains null
```
parquet meta
/tmp/t1/part-00000-340a4bdf-2e2c-42a8-a38a-01b47ab7d3c0-c000.snappy.parquet
File path:
/tmp/t1/part-00000-340a4bdf-2e2c-42a8-a38a-01b47ab7d3c0-c000.snappy.parquet
Created by: parquet-mr version 1.13.1 (build
db4183109d5b734ec5930d870cdae161e408ddba)
Properties:
org.apache.spark.version: 3.5.5
org.apache.spark.sql.parquet.row.metadata:
{"type":"struct","fields":[{"name":"c0","type":{"type":"map","keyType":{"type":"struct","fields":[{"name":"a","type":"long","nullable":false,"metadata":{}},{"name":"b","type":"long","nullable":false,"metadata":{}},{"name":"c","type":"long","nullable":false,"metadata":{}}]},"valueType":{"type":"struct","fields":[{"name":"x","type":"long","nullable":false,"metadata":{}},{"name":"y","type":"string","nullable":false,"metadata":{}},{"name":"z","type":"long","nullable":false,"metadata":{}}]},"valueContainsNull":true},"nullable":false,"metadata":{}}]}
Schema:
message spark_schema {
required group c0 (MAP) {
repeated group key_value {
required group key {
required int64 a;
required int64 b;
required int64 c;
}
optional group value {
required int64 x;
required binary y (STRING);
required int64 z;
}
}
}
}
```
Reading the file through Spark and print schema. Note `x, y, z` are nullable
now although it is not in Parquet.
```
scala> spark.read.parquet("/tmp/t1").printSchema
root
|-- c0: map (nullable = true)
| |-- key: struct
| | |-- a: long (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
| |-- value: struct (valueContainsNull = true)
| | |-- x: long (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: long (nullable = true)
```
I assuma Spark is smart enough to infer nullability of `x,y,z` based on the
`value` is optional in Parquet file and tweak the schema accordingly.
DataFusion cannot do that afaik.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]