Hi Ted, All good points. In the Jira ticket, I mentioned that using casts to infer type is a special case of a general problem: inferring column type information in a downward traversal of the parse tree, then refining that type information with a bottom-up traversal of the DAG at run time. Turns out this problem is similar to one that the dynamic language people are exploring to convert dynamic types to static types automatically. There is a paper reference in the ticket.
Our problem occurs, in general, with that bottom-up type inference. Because Drill is distributed, different branches of the tree may make different type inferences. (We discussed several examples.) We hope that higher nodes can reconcile conflicting decisions made by lower nodes, but we've noted cases where this is not possible. And, we've noted cases where the scanner itself ends up making conflicting decisions (e.g. a long run of nulls followed by non-null data that reveals the type.) Further, schema is a property of the DATA not the QUERY. But, casts are a property of the query. So, relying exclusively on a query to provide type information must be an incomplete and redundant solution. That said, if type information is available only in the query, we should certainly use it to resolve column type ambiguity. Cast is simply the most obvious source of inference. So, better type inference is helpful, but neither necessary nor sufficient. There is also the mismatch between SQL syntax and the desired simplicity of a hint. We want to provide a hint only for a troublesome column or two, but to mention one column in a SELECT clause, we must mention all of them; we can no longer use the wildcard (which, of course, shouldn't be used, but sure is handy during data exploration: the very situation where the type problem is most likely to occur.) Here it might be useful to recall the subtitle of the article: "Data governance and the death of schema on read." The author's point is that, in a production system, schemas cannot be ad-hoc, loosely-goosey (my terms), but rather must be well-understood and agreed upon so that all parties agree on data structure and meaning. In Drill's world, it says that schema (or hints) are discovered during data exploration, the results captured in some form of metadata, and then used by a wide number of users sitting in Tableau getting their day-to-day work done. In short, schema (and metadata in general) are needed to move from exploration into production. I personally am receptive to this idea because of the many years spent building BI tools in which schema information provided a much more robust and simple user experience. Thanks, - Paul On Sunday, April 8, 2018, 10:58:02 AM PDT, Ted Dunning <ted.dunn...@gmail.com> wrote: I have been thinking about this email and I still don't understand some of the comments. On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha <amansi...@apache.org> wrote: > On the subject of CAST pushdown to Scans, there are potential drawbacks > ... > > - In general, the planner will see a Scan-Project where the Project has > CAST functions. But the Project can have arbitrary expressions, e.g > CAST(a as INT) * 5 or a combination of 2 CAST functions or non-CAST > functions etc. It would be quite expensive to examine each expression > (there could be hundreds) to determine whether it is eligible to be > pushed > to the Scan. > How is this different than filter and project pushdown? There could be hundreds of those and it could be difficult for Calcite to find appropriate pushdowns. But I have never heard of any problem. The reasons that I think that cast pushdown would be much easier include: - for a first approximation, no type inference would be needed. - because of the first point, only the roots of arithmetic expressions would need to be examined. If they have casts, then pushdown should be tried. If not, don't do it. - cast pushdown is always a win if supported so there is no large increase in the complexity of the cost-based optimization search space. - the traversal of all expressions is already required and already done in order to find the set of columns that are being extracted. As such, cast pushdown can be done in the same motions as project pushdown. > - Expressing Nullability is not possible with CAST. If a column should > be tagged as (not)nullable, CAST syntax does not allow that. > This may be true. But nullability crosses the cast cleanly. Thus, filter expressions like [x is not NULL] can be used to constrain nullability and there is no requirement that the two constraints (the cast and the nullability) need not be near each other syntactically. Furthermore, if the query does not specify nullability, then the scanner is free to do so. > - Drill currently supports CASTing to a SQL data type, but not to the > complex types such as arrays and maps. We would have to add support for > that from a language perspective as well as the run-time. This would be > non-trivial effort. > Well, there is a trivial subset of this effort in that casting a.b.c is easy to express. Anything more complex is hard for current scanners to use anyway. So deferring most of the work on complex types is a fine solution. It isn't like SQL has nice syntax for casting of anything.