[
https://issues.apache.org/jira/browse/DRILL-7479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995235#comment-16995235
]
ASF GitHub Bot commented on DRILL-7479:
---------------------------------------
paul-rogers commented on issue #1923: DRILL-7479: Partial fixes for metadata
parameterized type issues
URL: https://github.com/apache/drill/pull/1923#issuecomment-565248380
@vvysotskyi, thanks for the explanation on the `FilterExpression.Visitor`
class. I think your idea is right. The question is, how do we achieve
extensibility? This is a very complex subject to discuss via written form, but
I'll try. I'm sure I'll miss something obvious.
Any metastore can add its own metadata fields (`min`, `max`, `myOwnMetric`),
etc. When that happens, the rest of Drill should *not* change. So, adding a
class to a class hierarchy that Drill must understand may not be the way to go.
Instead, we have to identify the parts that are fixed by Drill and the parts
that the specific metastore can extend.
Let's take our list of metadata "fields." Drill defines `min` and `max`: I
as a metastore author need to honor Drill's definition. (Or, I need to tell
that I can't report `min` and `max`, but I can report, say, `NDV` because I'm
Hive.) On the other hand, `myOwnMetric` is something I define for use, perhaps,
in my own WizzBang store plugin.
We are talking about a predicate visitor. For the code to compile in Drill,
the visitor has to know about all possible predicates that can be visited;
that's the only way to create the `visit(Foo)` methods in the base class.
I like how you've categorized the predicates. (MUCH easier than using
internal function names the way many plugins currently do.) (I might quibble
and say that there are only four predicate types: Unary (IS NULL, IS TRUE,
etc.), Binary (=, !=, etc.), Boolean (AND, OR) and List (IN).) Yes, Boolean
predicates are binary, but they are "special" because, by definition, each of
their two arguments is, itself, a predicate (or a Boolean literal.)
So, since Drill defines the predicates (they come from the SQL query), the
types should be known to Drill.
For predicates derived from a SQL query, the argument types are also
defined: Drill must have been able to parse them out of the SQL text and apply
some type (at least Number, String, Boolean). And, predicates can be of mixed
types: `10 = '10'` say. So, the predicate *result* is binary, but the arguments
are of any type (even if they don't make sense as in my example.)
What, then, can be extended? The underlying type of a column. (Maybe `INT96`
in Parquet, or `IPAddress` in some security system.) My custom plugin has to
convert between my values (`INT96`) and Drill's values. So, if we see `X =
123456`, our custom code might know that `X` is a column in our DB that is of
type `INT96`, so we might then try to convert the Drill-type (`INT`, say) of
`123456` into the `INT96` type for our push-down filter.
In stats, I might define `min` to be of my own specific `INT96` type. (That
is, the `min` value is of the type of the actual data, not of the type that the
reader will convert it to in Drill.)
Drill's description of the min value must be generic, however. (Drill does
not know about my type.) One way to do that is to add "meta-meta-data" that
says that the value of this metadata field is my private `INT96` type. Only I
know how to read those values; Drill treats them as `Object`s.
If something wanted to do a query on my metadata, my "meta-meta-data" could
provide a `toString` method to convert the `INT96` `min` value to something
that Drill can display.
Oh, one other topic to cover. Different systems provide different stats.
Generally, a system provides NDV or a histogram (or nothing). This info exists,
primarily, to compute selectivity for a filter. (How many values will `X = 10`
match?) Where should this logic be? Generically in Drill? For equality:
```
if (metadata == null) {
sel = 0.15;
} else if (metadata.get("histogram") != null) {
sel = metadata.getHistorgramProbability(arg);
} else if (metadata.get("ndv") != null) {
sel = 1 / metadata.get("ndv");
} ...
```
Or, we could make this more generic: require that the metadata provide a
`computeSelectivity(arg)` method that "does the right thing" for the system in
question with the metadata available. (A special "dummy" metadata provider
could provide Calcite's default 0.15 guess.)
In short, this is a tricky topic. We have to think about what Drill does and
what the extension does. Maybe we can move the discussion to another forum
(maybe direct e-mail) so we don't spam the dev list. Then we can work through
specific use cases so I can perhaps make more specific suggestions.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Short-term fixes for metadata API parameterized type issues
> -----------------------------------------------------------
>
> Key: DRILL-7479
> URL: https://issues.apache.org/jira/browse/DRILL-7479
> Project: Apache Drill
> Issue Type: Task
> Affects Versions: 1.17.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Blocker
> Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> See DRILL-7480 for a discussion of the issues with how we currently use
> parameterized types in the metadata API.
> This ticket is for short-term fixes that convert unsafe generic types of the
> form {{StatisticsHolder}} to the form {{StatisticsHolder<?>}} so that the
> compiler does not complain with many warnings (and a few Eclipse-only errors.)
> The topic should be revisited later in the context of DRILL-7480.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)