Hello Devs,
I was measuring perf on structs between V1 and V2
datasources. Found that although Iceberg Reader supports
`SupportsPushDownRequiredColumns` it doesn't seem to prune nested column
projections. I want to be able to prune on nested fields. How does V2
datasource have provision to be able to let Iceberg decide this? The
`SupportsPushDownRequiredColumns` mix-in gives the entire struct field even
if a sub-field is requested.
*Here's an illustration .. *
scala> spark.sql("select location.lat from iceberg_people_struct").show()
+-------+
| lat|
+-------+
| null|
|101.123|
|175.926|
+-------+
The pruning gets the entire struct instead of just `location.lat` ..
*public void pruneColumns(StructType newRequestedSchema) *
19/08/30 16:25:38 WARN Reader: => Prune columns : {
"type" : "struct",
"fields" : [ {
"name" : "location",
"type" : {
"type" : "struct",
"fields" : [ {
"name" : "lat",
"type" : "double",
"nullable" : true,
"metadata" : { }
}, {
"name" : "lon",
"type" : "double",
"nullable" : true,
"metadata" : { }
} ]
},
"nullable" : true,
"metadata" : { }
} ]
}
Is there information I can use in the IcebergSource (or add some) that can
be used to prune the exact sub-field here? What's a good way to approach
this? For dense/wide struct fields this affects performance significantly.
Sample gist:
https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac
thanks and regards,
-Gautam.