[ 
https://issues.apache.org/jira/browse/DRILL-7479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995235#comment-16995235
 ] 

ASF GitHub Bot commented on DRILL-7479:
---------------------------------------

paul-rogers commented on issue #1923: DRILL-7479: Partial fixes for metadata 
parameterized type issues
URL: https://github.com/apache/drill/pull/1923#issuecomment-565248380
 
 
   @vvysotskyi, thanks for the explanation on the `FilterExpression.Visitor` 
class. I think your idea is right. The question is, how do we achieve 
extensibility? This is a very complex subject to discuss via written form, but 
I'll try. I'm sure I'll miss something obvious.
   
   Any metastore can add its own metadata fields (`min`, `max`, `myOwnMetric`), 
etc. When that happens, the rest of Drill should *not* change. So, adding a 
class to a class hierarchy that Drill must understand may not be the way to go. 
Instead, we have to identify the parts that are fixed by Drill and the parts 
that the specific metastore can extend.
   
   Let's take our list of metadata "fields." Drill defines `min` and `max`: I 
as a metastore author need to honor Drill's definition. (Or, I need to tell 
that I can't report `min` and `max`, but I can report, say, `NDV` because I'm 
Hive.) On the other hand, `myOwnMetric` is something I define for use, perhaps, 
in my own WizzBang store plugin.
   
   We are talking about a predicate visitor. For the code to compile in Drill, 
the visitor has to know about all possible predicates that can be visited; 
that's the only way to create the `visit(Foo)` methods in the base class.
   
   I like how you've categorized the predicates. (MUCH easier than using 
internal function names the way many plugins currently do.) (I might quibble 
and say that there are only four predicate types: Unary (IS NULL, IS TRUE, 
etc.), Binary (=, !=, etc.), Boolean (AND, OR) and List (IN).) Yes, Boolean 
predicates are binary, but they are "special" because, by definition, each of 
their two arguments is, itself, a predicate (or a Boolean literal.)
   
   So, since Drill defines the predicates (they come from the SQL query), the 
types should be known to Drill.
   
   For predicates derived from a SQL query, the argument types are also 
defined: Drill must have been able to parse them out of the SQL text and apply 
some type (at least Number, String, Boolean). And, predicates can be of mixed 
types: `10 = '10'` say. So, the predicate *result* is binary, but the arguments 
are of any type (even if they don't make sense as in my example.)
   
   What, then, can be extended? The underlying type of a column. (Maybe `INT96` 
in Parquet, or `IPAddress` in some security system.) My custom plugin has to 
convert between my values (`INT96`) and Drill's values. So, if we see `X = 
123456`, our custom code might know that `X` is a column in our DB that is of 
type `INT96`, so we might then try to convert the Drill-type (`INT`, say) of 
`123456` into the `INT96` type for our push-down filter.
   
   In stats, I might define `min` to be of my own specific `INT96` type. (That 
is, the `min` value is of the type of the actual data, not of the type that the 
reader will convert it to in Drill.)
   
   Drill's description of the min value must be generic, however. (Drill does 
not know about my type.) One way to do that is to add "meta-meta-data" that 
says that the value of this metadata field is my private `INT96` type. Only I 
know how to read those values; Drill treats them as `Object`s.
   
   If something wanted to do a query on my metadata, my "meta-meta-data" could 
provide a `toString` method to convert the `INT96` `min` value to something 
that Drill can display.
   
   Oh, one other topic to cover. Different systems provide different stats. 
Generally, a system provides NDV or a histogram (or nothing). This info exists, 
primarily, to compute selectivity for a filter. (How many values will `X = 10` 
match?) Where should this logic be? Generically in Drill? For equality:
   
   ```
     if (metadata == null) {
       sel = 0.15;
     } else if (metadata.get("histogram") != null) {
       sel = metadata.getHistorgramProbability(arg);
      } else if (metadata.get("ndv") != null) {
       sel = 1 / metadata.get("ndv");
     } ...
   ```
   
   Or, we could make this more generic: require that the metadata provide a 
`computeSelectivity(arg)` method that "does the right thing" for the system in 
question with the metadata available. (A special "dummy" metadata provider 
could provide Calcite's default 0.15 guess.)
   
   In short, this is a tricky topic. We have to think about what Drill does and 
what the extension does. Maybe we can move the discussion to another forum 
(maybe direct e-mail) so we don't spam the dev list. Then we can work through 
specific use cases so I can perhaps make more specific suggestions.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Short-term fixes for metadata API parameterized type issues
> -----------------------------------------------------------
>
>                 Key: DRILL-7479
>                 URL: https://issues.apache.org/jira/browse/DRILL-7479
>             Project: Apache Drill
>          Issue Type: Task
>    Affects Versions: 1.17.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Blocker
>              Labels: ready-to-commit
>             Fix For: 1.17.0
>
>
> See DRILL-7480 for a discussion of the issues with how we currently use 
> parameterized types in the metadata API.
> This ticket is for short-term fixes that convert unsafe generic types of the 
> form {{StatisticsHolder}} to the form {{StatisticsHolder<?>}} so that the 
> compiler does not complain with many warnings (and a few Eclipse-only errors.)
> The topic should be revisited later in the context of DRILL-7480.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to