prodeezy edited a comment on issue #136: Collect lower/upper bounds for nested struct fields in ParquetMetrics URL: https://github.com/apache/incubator-iceberg/pull/136#issuecomment-474692281 Thanks for this PR @aokolnychyi , ran an end to end test with this patch applied on latest code in master .. 1) Used a local spark instance that contains [feature to pushdown struct filters](https://github.com/apache/spark/pull/22573) and verified that filter is pushed down to data source ``` scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat = 101.123 ").explain() == Physical Plan == *(1) Project [age#0, name#1, friends#2, location#3] +- *(1) Filter (isnotnull(location#3) && (location#3.lat = 101.123)) +- *(1) ScanV2 iceberg[age#0, name#1, friends#2, location#3] (Filters: [isnotnull(location#3), (location#3.lat = 101.123)], Options: [path=iceberg-people-struct-metrics,paths=[]]) ``` 2) Created Parquet data using these metrics. Verified that the struct's leaf field metrics are stored now .. ``` avro-tools tojson iceberg-people-struct-metrics/metadata/e4f66767-8baa-4dee-8b3d-56a0c1d99464-m0.avro | jq "lower_bounds": { "array": [ { "key": 1, "value": "\u0013\u0000\u0000\u0000" }, { "key": 2, "value": "Andy" }, { "key": 7, "value": "\u001dZd;ßGY@" }, { "key": 8, "value": " \u001a/Ý$4@" } ] }, "upper_bounds": { "array": [ { "key": 1, "value": "\u001e\u0000\u0000\u0000" }, { "key": 2, "value": "Michael" }, { "key": 7, "value": "\u0012ÀÊ¡ýe@" }, { "key": 8, "value": "¶óýÔx)I@" } ``` 3) Applied my [struct filtering patch](https://github.com/apache/incubator-iceberg/pull/123) and ran filters on above table .. ``` scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat = 101.123 ").show() +---+----+--------------------+-----------------+ |age|name| friends| location| +---+----+--------------------+-----------------+ | 30|Andy|[Josh -> 10, Bisw...|[101.123, 50.324]| +---+----+--------------------+-----------------+ scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat < 101.123 ").show() +---+----+-------+--------+ |age|name|friends|location| +---+----+-------+--------+ +---+----+-------+--------+ scala> spark.sql("select * from iceberg_people_struct_metrics where location.lat > 200 ").show() +---+----+-------+--------+ |age|name|friends|location| +---+----+-------+--------+ +---+----+-------+--------+ ``` - Verified that struct filters don't fail and the expression evaluation in InclusiveMetricsEvaluator uses lower/upper bounds (using debug breakpoints) - Verified that out of bounds cases for eq, lt, gt, etc skip files / row groups appropriately
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
