[GitHub] [incubator-iceberg] prodeezy edited a comment on issue #136: Collect lower/upper bounds for nested struct fields in ParquetMetrics

GitBox Tue, 19 Mar 2019 22:44:56 -0700

prodeezy edited a comment on issue #136: Collect lower/upper bounds for nested 
struct fields in ParquetMetrics
URL: https://github.com/apache/incubator-iceberg/pull/136#issuecomment-474692281
 
 
   Thanks for this PR @aokolnychyi , ran an end to end test with this patch 
applied on latest code in master .. 
   
   1) Used a local spark instance that contains [feature to pushdown struct 
filters](https://github.com/apache/spark/pull/22573) and verified  that filter 
is pushed down to data source
   
   ```
   scala> spark.sql("select * from iceberg_people_struct_metrics where 
location.lat = 101.123 ").explain()
   == Physical Plan ==
   *(1) Project [age#0, name#1, friends#2, location#3]
   +- *(1) Filter (isnotnull(location#3) && (location#3.lat = 101.123))
      +- *(1) ScanV2 iceberg[age#0, name#1, friends#2, location#3] (Filters: 
[isnotnull(location#3), (location#3.lat = 101.123)], Options: 
[path=iceberg-people-struct-metrics,paths=[]])
   
   ```
   
   2) Created Parquet data using these metrics. Verified  that the struct's 
leaf field metrics are stored now .. 
   
   ```
   avro-tools tojson 
iceberg-people-struct-metrics/metadata/e4f66767-8baa-4dee-8b3d-56a0c1d99464-m0.avro
  | jq
   
   
   
       "lower_bounds": {
         "array": [
           {
             "key": 1,
             "value": "\u0013\u0000\u0000\u0000"
           },
           {
             "key": 2,
             "value": "Andy"
           },
           {
             "key": 7,
             "value": "\u001dZd;ßGY@"
           },
           {
             "key": 8,
             "value": " \u001a/Ý$4@"
           }
         ]
       },
       "upper_bounds": {
         "array": [
           {
             "key": 1,
             "value": "\u001e\u0000\u0000\u0000"
           },
           {
             "key": 2,
             "value": "Michael"
           },
           {
             "key": 7,
             "value": "\u0012ÀÊ¡ýe@"
           },
           {
             "key": 8,
             "value": "¶óýÔx)I@"
           }
   
   ```
   
   
   3) Applied my [struct filtering 
patch](https://github.com/apache/incubator-iceberg/pull/123)   and ran filters 
on above table .. 
   
   ```
   
   
   scala> spark.sql("select * from iceberg_people_struct_metrics where 
location.lat = 101.123 ").show()
   +---+----+--------------------+-----------------+
   |age|name|             friends|         location|
   +---+----+--------------------+-----------------+
   | 30|Andy|[Josh -> 10, Bisw...|[101.123, 50.324]|
   +---+----+--------------------+-----------------+
   
   scala> spark.sql("select * from iceberg_people_struct_metrics where 
location.lat < 101.123 ").show()
   +---+----+-------+--------+
   |age|name|friends|location|
   +---+----+-------+--------+
   +---+----+-------+--------+
   
   scala> spark.sql("select * from iceberg_people_struct_metrics where 
location.lat > 200 ").show()
   +---+----+-------+--------+
   |age|name|friends|location|
   +---+----+-------+--------+
   +---+----+-------+--------+
   ```
   
   - Verified that struct filters don't fail and the expression evaluation in 
InclusiveMetricsEvaluator uses lower/upper bounds (using debug breakpoints)
   - Verified that out of bounds cases for eq, lt, gt, etc skip files / row 
groups appropriately


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-iceberg] prodeezy edited a comment on issue #136: Collect lower/upper bounds for nested struct fields in ParquetMetrics

Reply via email to