rdblue edited a comment on issue #123: Add support for struct field based 
filtering
URL: https://github.com/apache/incubator-iceberg/pull/123#issuecomment-475823022
 
 
   @prodeezy, this is looking good. I think you have the right approach, but 
some things are missing.
   
   With this patch, `BoundReference` has the correct type and field ID. That 
works for `InclusiveMetricsEvaluator` and the Parquet filters that look up data 
based on ID. But it isn't enough for other evaluators that require binding:
   
   * `Evaluator` calls `BoundReference.get` to get a value from a `StructLike`. 
That method needs to be updated to handle struct nesting, preferably using an 
[`Accessor`](https://github.com/rdblue/iceberg/blob/support-predicates-on-nested-fields/api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java#L61-L217)
 (that's an unfinished branch of mine working on this problem). As it is now, 
evaluators would fail because `get` will try to access position -1 for any 
nested field.
   * `InclusiveManifestEvaluator` uses `BoundReference.pos` to get the correct 
`PartitionFieldSummary` from a `ManifestFile` instance. The partition field 
summaries are stored in the order of partition fields, so binding to the 
partition struct gives the correct position in the field summaries list. I 
think this would actually work because all fields are top-level, but I think 
there should be a better signal than returning -1.
   
   This should also bind differently. We never parse identifiers in Iceberg. 
Instead of parsing, we build indexes using the possible field names. That way, 
we never have to worry about quoting.
   
   For example, if a user passes in an expression for "a.b.c", that could 
multiple paths: "a" / "b" / "c" or "a.b" / "c" or "a" / "b.c" or "a.b.c". But 
the important thing is that there can be only once column that flattens to 
"a.b.c" because the columns are ambiguous otherwise. Rather than searching, we 
keep a map in the schema from "a.b.c" to the right field ID.
   
   Expression binding should use an index like this and then build accessors 
using the actual path of field names.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to