rdblue edited a comment on issue #123: Add support for struct field based filtering URL: https://github.com/apache/incubator-iceberg/pull/123#issuecomment-475823022 @prodeezy, this is looking good. I think you have the right approach, but some things are missing. With this patch, `BoundReference` has the correct type and field ID. That works for `InclusiveMetricsEvaluator` and the Parquet filters that look up data based on ID. But it isn't enough for other evaluators that require binding: * `Evaluator` calls `BoundReference.get` to get a value from a `StructLike`. That method needs to be updated to handle struct nesting, preferably using an [`Accessor`](https://github.com/rdblue/iceberg/blob/support-predicates-on-nested-fields/api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java#L61-L217) (that's an unfinished branch of mine working on this problem). As it is now, evaluators would fail because `get` will try to access position -1 for any nested field. * `InclusiveManifestEvaluator` uses `BoundReference.pos` to get the correct `PartitionFieldSummary` from a `ManifestFile` instance. The partition field summaries are stored in the order of partition fields, so binding to the partition struct gives the correct position in the field summaries list. I think this would actually work because all fields are top-level, but I think there should be a better signal than returning -1. This should also bind differently. We never parse identifiers in Iceberg. Instead of parsing, we build indexes using the possible field names. That way, we never have to worry about quoting. For example, if a user passes in an expression for "a.b.c", that could multiple paths: "a" / "b" / "c" or "a.b" / "c" or "a" / "b.c" or "a.b.c". But the important thing is that there can be only once column that flattens to "a.b.c" because the columns are ambiguous otherwise. Rather than searching, we keep a map in the schema from "a.b.c" to the right field ID. Expression binding should use an index like this and then build accessors using the actual path of field names.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
