paul-rogers commented on issue #1923: DRILL-7479: Partial fixes for metadata parameterized type issues URL: https://github.com/apache/drill/pull/1923#issuecomment-565248380 @vvysotskyi, thanks for the explanation on the `FilterExpression.Visitor` class. I think your idea is right. The question is, how do we achieve extensibility? This is a very complex subject to discuss via written form, but I'll try. I'm sure I'll miss something obvious. Any metastore can add its own metadata fields (`min`, `max`, `myOwnMetric`), etc. When that happens, the rest of Drill should *not* change. So, adding a class to a class hierarchy that Drill must understand may not be the way to go. Instead, we have to identify the parts that are fixed by Drill and the parts that the specific metastore can extend. Let's take our list of metadata "fields." Drill defines `min` and `max`: I as a metastore author need to honor Drill's definition. (Or, I need to tell that I can't report `min` and `max`, but I can report, say, `NDV` because I'm Hive.) On the other hand, `myOwnMetric` is something I define for use, perhaps, in my own WizzBang store plugin. We are talking about a predicate visitor. For the code to compile in Drill, the visitor has to know about all possible predicates that can be visited; that's the only way to create the `visit(Foo)` methods in the base class. I like how you've categorized the predicates. (MUCH easier than using internal function names the way many plugins currently do.) (I might quibble and say that there are only four predicate types: Unary (IS NULL, IS TRUE, etc.), Binary (=, !=, etc.), Boolean (AND, OR) and List (IN).) Yes, Boolean predicates are binary, but they are "special" because, by definition, each of their two arguments is, itself, a predicate (or a Boolean literal.) So, since Drill defines the predicates (they come from the SQL query), the types should be known to Drill. For predicates derived from a SQL query, the argument types are also defined: Drill must have been able to parse them out of the SQL text and apply some type (at least Number, String, Boolean). And, predicates can be of mixed types: `10 = '10'` say. So, the predicate *result* is binary, but the arguments are of any type (even if they don't make sense as in my example.) What, then, can be extended? The underlying type of a column. (Maybe `INT96` in Parquet, or `IPAddress` in some security system.) My custom plugin has to convert between my values (`INT96`) and Drill's values. So, if we see `X = 123456`, our custom code might know that `X` is a column in our DB that is of type `INT96`, so we might then try to convert the Drill-type (`INT`, say) of `123456` into the `INT96` type for our push-down filter. In stats, I might define `min` to be of my own specific `INT96` type. (That is, the `min` value is of the type of the actual data, not of the type that the reader will convert it to in Drill.) Drill's description of the min value must be generic, however. (Drill does not know about my type.) One way to do that is to add "meta-meta-data" that says that the value of this metadata field is my private `INT96` type. Only I know how to read those values; Drill treats them as `Object`s. If something wanted to do a query on my metadata, my "meta-meta-data" could provide a `toString` method to convert the `INT96` `min` value to something that Drill can display. Oh, one other topic to cover. Different systems provide different stats. Generally, a system provides NDV or a histogram (or nothing). This info exists, primarily, to compute selectivity for a filter. (How many values will `X = 10` match?) Where should this logic be? Generically in Drill? For equality: ``` if (metadata == null) { sel = 0.15; } else if (metadata.get("histogram") != null) { sel = metadata.getHistorgramProbability(arg); } else if (metadata.get("ndv") != null) { sel = 1 / metadata.get("ndv"); } ... ``` Or, we could make this more generic: require that the metadata provide a `computeSelectivity(arg)` method that "does the right thing" for the system in question with the metadata available. (A special "dummy" metadata provider could provide Calcite's default 0.15 guess.) In short, this is a tricky topic. We have to think about what Drill does and what the extension does. Maybe we can move the discussion to another forum (maybe direct e-mail) so we don't spam the dev list. Then we can work through specific use cases so I can perhaps make more specific suggestions.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
