[jira] [Commented] (DRILL-7480) Revisit parameterized type design for Metadata API

Vova Vysotskyi (Jira) Mon, 16 Dec 2019 04:31:13 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997237#comment-16997237
 ]


Vova Vysotskyi commented on DRILL-7480:
---------------------------------------

By [~Paul.Rogers]:
{quote}
Another commit to fix the remaining warnings in metastore.

Just to summarize based on what I've seen, there are two use cases:

Generic: Example: take two columns, get their min values generically, combine 
them generically. Compile-time typing is a hindrance here; we need run-time 
typing.
Specific: compute or obtain a specific value of a specific type. The NDV, say, 
to be used in estimating selectivity.
It is easy to go from specific to generic:

  int foo = ...;
  metadata.setValue(foo);
...
  public void setValue(Object value) {
...
If is harder to go the other way. If we know the value is an int:

  int foo = metadata.getInt();
  // Or, less ideally:
  assert metadata.getType() == MinorType.INT;
  int foo = (Int) metadata.getValue();
Here getInt() just does in a method what the second example does inline.

To be type aware for all types:

  switch(metadata.getType()) {
    case INT:
      int intFoo = (Int) metadata.getValue();
      ...
    case VARCHAR:
      String strFoo = (String) metadata.getValue();
    ...
The above is true if we introduce, say a ValueHolder<T> that holds a value of 
type T. Instead of casting the value, we'd cast the value holder. There is no 
escaping the issue.

Point is: we need to understand the use cases so we can more precisely refine 
the metadata data model.

Some example:

The pending filter push-down PR: uses a value holder with a (type, value) pair.
The Impala solution: has nodes that represent literals, with subclasses for 
various kinds of literal: IntLiteral, StringLiteral, etc.
The JDBC approach: a separate schema that tells us that when foo appears, it is 
an int, so we should use getInt(colIndex).
The ColumnMetadata (and HOCON config) approach; store values as strings, parse 
to the desired type on demand.
Perhaps one of these (or, more likely, some combination) might help us here.
{quote}

{quote}
Any metastore can add its own metadata fields (min, max, myOwnMetric), etc. 
When that happens, the rest of Drill should not change. So, adding a class to a 
class hierarchy that Drill must understand may not be the way to go. Instead, 
we have to identify the parts that are fixed by Drill and the parts that the 
specific metastore can extend.

Let's take our list of metadata "fields." Drill defines min and max: I as a 
metastore author need to honor Drill's definition. (Or, I need to tell that I 
can't report min and max, but I can report, say, NDV because I'm Hive.) On the 
other hand, myOwnMetric is something I define for use, perhaps, in my own 
WizzBang store plugin.

We are talking about a predicate visitor. For the code to compile in Drill, the 
visitor has to know about all possible predicates that can be visited; that's 
the only way to create the visit(Foo) methods in the base class.

I like how you've categorized the predicates. (MUCH easier than using internal 
function names the way many plugins currently do.) (I might quibble and say 
that there are only four predicate types: Unary (IS NULL, IS TRUE, etc.), 
Binary (=, !=, etc.), Boolean (AND, OR) and List (IN).) Yes, Boolean predicates 
are binary, but they are "special" because, by definition, each of their two 
arguments is, itself, a predicate (or a Boolean literal.)

So, since Drill defines the predicates (they come from the SQL query), the 
types should be known to Drill.

For predicates derived from a SQL query, the argument types are also defined: 
Drill must have been able to parse them out of the SQL text and apply some type 
(at least Number, String, Boolean). And, predicates can be of mixed types: 10 = 
'10' say. So, the predicate result is binary, but the arguments are of any type 
(even if they don't make sense as in my example.)

What, then, can be extended? The underlying type of a column. (Maybe INT96 in 
Parquet, or IPAddress in some security system.) My custom plugin has to convert 
between my values (INT96) and Drill's values. So, if we see X = 123456, our 
custom code might know that X is a column in our DB that is of type INT96, so 
we might then try to convert the Drill-type (INT, say) of 123456 into the INT96 
type for our push-down filter.

In stats, I might define min to be of my own specific INT96 type. (That is, the 
min value is of the type of the actual data, not of the type that the reader 
will convert it to in Drill.)

Drill's description of the min value must be generic, however. (Drill does not 
know about my type.) One way to do that is to add "meta-meta-data" that says 
that the value of this metadata field is my private INT96 type. Only I know how 
to read those values; Drill treats them as Objects.

If something wanted to do a query on my metadata, my "meta-meta-data" could 
provide a toString method to convert the INT96 min value to something that 
Drill can display.

Oh, one other topic to cover. Different systems provide different stats. 
Generally, a system provides NDV or a histogram (or nothing). This info exists, 
primarily, to compute selectivity for a filter. (How many values will X = 10 
match?) Where should this logic be? Generically in Drill? For equality:

  if (metadata == null) {
    sel = 0.15;
  } else if (metadata.get("histogram") != null) {
    sel = metadata.getHistorgramProbability(arg);
   } else if (metadata.get("ndv") != null) {
    sel = 1 / metadata.get("ndv");
  } ...
Or, we could make this more generic: require that the metadata provide a 
computeSelectivity(arg) method that "does the right thing" for the system in 
question with the metadata available. (A special "dummy" metadata provider 
could provide Calcite's default 0.15 guess.)

In short, this is a tricky topic. We have to think about what Drill does and 
what the extension does. Maybe we can move the discussion to another forum 
(maybe direct e-mail) so we don't spam the dev list. Then we can work through 
specific use cases so I can perhaps make more specific suggestions.
{quote}

> Revisit parameterized type design for Metadata API
> --------------------------------------------------
>
>                 Key: DRILL-7480
>                 URL: https://issues.apache.org/jira/browse/DRILL-7480
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.16.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Grabbed latest master and found that the code will not build in Eclipse due 
> to a type mismatch in the statistics code. Specifically, the problem is that 
> we have several parameterized classes, but we often omit the parameters. 
> Evidently, doing so is fine for some compilers, but is an error in Eclipse.
> Then, while fixing the immediate issue, I found an opposite problem: code 
> that would satisfy Eclipse, but which failed in the Maven build.
> I spent time making another pass through the metadata code to add type 
> parameters, remove "rawtypes" ignores and so on. See DRILL-7479.
> Stepping back a bit, it seems that we are perhaps using the type parameters 
> in a way that does not serve our needs in this particular case.
> We have many classes that hold onto particular values of some type, such as 
> {{StatisticsHolder}}, which can hold a String, a Double, etc. So, we 
> parameterize.
> But, after that, we treat the items generically. We don't care that {{foo}} 
> is a {{StatisticsHolder<String>}} and {{bar}} is 
> {{StatisticsHolder<Double>}}, we just want to create, combine and work with 
> lists of statistics.
> The same is true in several other places such as column type, comparator 
> type, etc. For comparators, we don't really care what type they compare, we 
> just want, given two generic \{{StatisticsHolder}}s to get the corresponding 
> comparator.
> This is very similar to the situation with the "column accessors" in EVF: 
> each column is a {{VARCHAR}} or a\{{ FLOAT8}}, but most code just treats them 
> generically. So, the type-ness of the value was treated as data a runtime 
> attribute, not a compile-time attribute.
> This is a subtle point. Most code in Drill does not work with types directly 
> in Java code. Instead, Drill is an interpreter: it works with generic objects 
> which, at run time, resolve to actual typed objects. It is the difference 
> between writing an application (directly uses types) and writing a language 
> (generically works with all types.)
> For example, a {{StatsticsHolder}} probably only needs to be type-aware at 
> the moment it is populated or used, but not in all the generic column-level 
> and table level code. (The same is true of properties in the column metadata 
> class, as an example.)
> IMHO, {{StatsticsHolder}} probably wants to be a non-parameterized class. It 
> should have a declaration object that, say, provides the name, type, 
> comparator and with other metadata. When the actual value is needed, a typed 
> getter can be provided:
> {code:java}
> <T> T getValue();
> {code}
> As it is, the type system is very complex but we get no value. Since it is so 
> complex, the code just punted and sprinkled raw types and ignores in many 
> places, which defeats the purpose of parameterized types anyway.
> Suggestion: let's revisit this work after the upcoming release and see if we 
> can simplify it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7480) Revisit parameterized type design for Metadata API

Reply via email to