Paul Rogers commented on DRILL-6312:

While type inference (using Cast and other hints) is a very good idea, it 
cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. 
If a column is not projected, no hint can be provided, and we can end up with 
possible read-time problems as discussed in the original e-mail thread ("Death 
of Schema on Read").
* The only way to express the type of a column is to explicitly include it in 
the SELECT clause. Using a wildcard ("*") query will bypass the type rules 
unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the 
others to be detected automatically. If we must use a view, and we have to, 
say, use a cast for column x, then we have to include all other columns in the 
SELECT clause or we end up projecting only x.

For this reason, the cast idea, though elegant, and a very good enhancement, 
cannot be the full answer, It will reduce the number of cases where type 
ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply 
type information. For example, in a view or query, provide explicit hint syntax:

SELECT * FROM myFunkyTable

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type 
specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to 
metadata, so it must be in the query (or, equivalently, a view.)

> Enable pushing of cast expressions to the scanner for better schema discovery.
> ------------------------------------------------------------------------------
>                 Key: DRILL-6312
>                 URL: https://issues.apache.org/jira/browse/DRILL-6312
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators, Query Planning & 
> Optimization
>    Affects Versions: 1.13.0
>            Reporter: Hanumath Rao Maduri
>            Priority: Major
> Drill is a schema less engine which tries to infer the schema from disparate 
> sources at the read time. Currently the scanners infer the schema for each 
> batch depending upon the data for that column in the corresponding batch. 
> This solves many uses cases but can error out when the data is too different 
> between batches like int and array[int] etc... (There are other cases as well 
> but just to give one example).
> There is also a mechanism to create a view by type casting the columns to 
> appropriate type. This solves issues in some cases but fails in many other 
> cases. This is due to the fact that cast expression is not being pushed down 
> to the scanner but staying at the project or filter etc operators up the 
> query plan.
> This JIRA is to fix this by propagating the type information embedded in the 
> cast function to the scanners so that scanners can cast the incoming data 
> appropriately.

This message was sent by Atlassian JIRA

Reply via email to