Re: "Death of Schema-on-Read"

Paul Rogers Sun, 08 Apr 2018 12:03:45 -0700

Hi Ted,

All good points. In the Jira ticket, I mentioned that using casts to infer type 
is a special case of a general problem: inferring column type information in a 
downward traversal of the parse tree, then refining that type information with 
a bottom-up traversal of the DAG at run time. Turns out this problem is similar 
to one that the dynamic language people are exploring to convert dynamic types 
to static types automatically. There is a paper reference in the ticket.

Our problem occurs, in general, with that bottom-up type inference. Because 
Drill is distributed, different branches of the tree may make different type 
inferences. (We discussed several examples.) We hope that higher nodes can 
reconcile conflicting decisions made by lower nodes, but we've noted cases 
where this is not possible. And, we've noted cases where the scanner itself 
ends up making conflicting decisions (e.g. a long run of nulls followed by 
non-null data that reveals the type.)

Further, schema is a property of the DATA not the QUERY. But, casts are a 
property of the query. So, relying exclusively on a query to provide type 
information must be an incomplete and redundant solution. That said, if type 
information is available only in the query, we should certainly use it to 
resolve column type ambiguity. Cast is simply the most obvious source of 
inference. So, better type inference is helpful, but neither necessary nor 
sufficient.

There is also the mismatch between SQL syntax and the desired simplicity of a 
hint. We want to provide a hint only for a troublesome column or two, but to 
mention one column in a SELECT clause, we must mention all of them; we can no 
longer use the wildcard (which, of course, shouldn't be used, but sure is handy 
during data exploration: the very situation where the type problem is most 
likely to occur.)

Here it might be useful to recall the subtitle of the article: "Data governance 
and the death of schema on read." The author's point is that, in a production 
system, schemas cannot be ad-hoc, loosely-goosey (my terms), but rather must be 
well-understood and agreed upon so that all parties agree on data structure and 
meaning. In Drill's world, it says that schema (or hints) are discovered during 
data exploration, the results captured in some form of metadata, and then used 
by a wide number of users sitting in Tableau getting their day-to-day work 
done. In short, schema (and metadata in general) are needed to move from 
exploration into production. I personally am receptive to this idea because of 
the many years spent building BI tools in which schema information provided a 
much more robust and simple user experience.

Thanks,

- Paul

    On Sunday, April 8, 2018, 10:58:02 AM PDT, Ted Dunning 
<[email protected]> wrote:  

 I have been thinking about this email and I still don't understand some of
the comments.

On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha <[email protected]> wrote:

> On the subject of CAST pushdown to Scans, there are potential drawbacks
> ...
>
>    - In general, the planner will see a Scan-Project where the Project has
>    CAST functions.  But the Project can have arbitrary expressions,  e.g
>    CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
>    functions etc.  It would be quite expensive to examine each expression
>    (there could be hundreds) to determine whether it is eligible to be
> pushed
>    to the Scan.
>

How is this different than filter and project pushdown? There could be
hundreds of those and it could be difficult for Calcite to find appropriate
pushdowns. But I have never heard of any problem.

The reasons that I think that cast pushdown would be much easier include:

- for a first approximation, no type inference would be needed.

- because of the first point, only the roots of arithmetic expressions
would need to be examined. If they have casts, then pushdown should be
tried. If not, don't do it.

- cast pushdown is always a win if supported so there is no large increase
in the complexity of the cost-based optimization search space.

- the traversal of all expressions is already required and already done in
order to find the set of columns that are being extracted. As such, cast
pushdown can be done in the same motions as project pushdown.

>    - Expressing Nullability is not possible with CAST.  If a column should
>    be tagged as  (not)nullable, CAST syntax does not allow that.
>

This may be true. But nullability crosses the cast cleanly. Thus, filter
expressions like [x is not NULL] can be used to constrain nullability and
there is no requirement that the two constraints (the cast and the
nullability) need not be near each other syntactically. Furthermore, if the
query does not specify nullability, then the scanner is free to do so.

>    - Drill currently supports CASTing to a SQL data type, but not to the
>    complex types such as arrays and maps.  We would have to add support for
>    that from a language perspective as well as the run-time.  This would be
>    non-trivial effort.
>

Well, there is a trivial subset of this effort in that casting a.b.c is
easy to express. Anything more complex is hard for current scanners to use
anyway.

So deferring most of the work on complex types is a fine solution. It isn't
like SQL has nice syntax for casting of anything.

Re: "Death of Schema-on-Read"

Reply via email to