Nested Column Pruning in Iceberg (DSV2) ..

Gautam Fri, 30 Aug 2019 05:42:58 -0700

Hello Devs,
                    I was measuring perf on structs between V1 and V2
datasources. Found that although Iceberg Reader supports
`SupportsPushDownRequiredColumns` it doesn't seem to prune nested column
projections. I want to be able to prune on nested fields. How does V2
datasource have provision to be able to let Iceberg decide this? The
`SupportsPushDownRequiredColumns` mix-in gives the entire struct field even
if a sub-field is requested.


*Here's an illustration .. *

scala> spark.sql("select location.lat from iceberg_people_struct").show()
+-------+
|    lat|
+-------+
|   null|
|101.123|
|175.926|
+-------+


The pruning gets the entire struct instead of just `location.lat`  ..

*public void pruneColumns(StructType newRequestedSchema) *

19/08/30 16:25:38 WARN Reader: => Prune columns : {
  "type" : "struct",
  "fields" : [ {
    "name" : "location",
    "type" : {
      "type" : "struct",
      "fields" : [ {
        "name" : "lat",
        "type" : "double",
        "nullable" : true,
        "metadata" : { }
      }, {
        "name" : "lon",
        "type" : "double",
        "nullable" : true,
        "metadata" : { }
      } ]
    },
    "nullable" : true,
    "metadata" : { }
  } ]
}

Is there information I can use in the IcebergSource (or add some) that can
be used to prune the exact sub-field here?  What's a good way to approach
this? For dense/wide struct fields this affects performance significantly.


Sample gist:
https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac


thanks and regards,
-Gautam.

Nested Column Pruning in Iceberg (DSV2) ..

Reply via email to