[jira] [Comment Edited] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

Paul Rogers (JIRA) Sat, 07 Apr 2018 22:51:19 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429634#comment-16429634
 ]


Paul Rogers edited comment on DRILL-6312 at 4/8/18 5:50 AM:
------------------------------------------------------------

While type inference (using Cast and other hints) is a very good idea, it 
cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. 
If a column is not projected, no hint can be provided, and we can end up with 
possible read-time problems as discussed in the original e-mail thread ("Death 
of Schema on Read").
* The only way to express the type of a column is to explicitly include it in 
the SELECT clause. Using a wildcard ("*") query will bypass the type rules 
unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the 
others to be detected automatically. If we must use a view, and we have to, 
say, use a cast for column x, then we have to include all other columns in the 
SELECT clause or we end up projecting only x. We can't use a wildcard for the 
other columns.
* Putting the type information in the query puts the burden on the query writer 
(and, ultimately, something like Tableau.) But, the schema is a property of the 
data, not the query, so this is not good model of reality.

For this reason, the cast idea, though elegant, and a very good enhancement, 
cannot be the full answer, It will reduce the number of cases where type 
ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply 
type information. For example, in a view or query, provide explicit hint syntax:

{noformat}
SELECT * FROM myFunkyTable
  WITH HINTS (f: INT, m.x: BIGINT NOT NULL,  a[]: VARCHAR NULL)
{noformat}

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type 
specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to 
metadata, so it must be in the query (or, equivalently, a view.)

Lore has it that the early Drill designers proposed a ".drill" file to hold 
schema information. In this case, schema information would be an add-on file, 
much as views are. As proposed in the e-mail thread, perhaps both forms of 
information can be combined in a single file.


was (Author: paul-rogers):
While type inference (using Cast and other hints) is a very good idea, it 
cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. 
If a column is not projected, no hint can be provided, and we can end up with 
possible read-time problems as discussed in the original e-mail thread ("Death 
of Schema on Read").
* The only way to express the type of a column is to explicitly include it in 
the SELECT clause. Using a wildcard ("*") query will bypass the type rules 
unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the 
others to be detected automatically. If we must use a view, and we have to, 
say, use a cast for column x, then we have to include all other columns in the 
SELECT clause or we end up projecting only x. We can't use a wildcard for the 
other columns.
* Putting the type information in the query puts the burden on the query writer 
(and, ultimately, something like Tableau.) But, the schema is a property of the 
data, not the query, so this is not good model of reality.

For this reason, the cast idea, though elegant, and a very good enhancement, 
cannot be the full answer, It will reduce the number of cases where type 
ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply 
type information. For example, in a view or query, provide explicit hint syntax:

{noformat}
SELECT * FROM myFunkyTable
  WITH HINTS (f: INT, m.x: BIGINT NOT NULL,  a[]: VARCHAR NULL)
{noformat}

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type 
specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to 
metadata, so it must be in the query (or, equivalently, a view.)

> Enable pushing of cast expressions to the scanner for better schema discovery.
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-6312
>                 URL: https://issues.apache.org/jira/browse/DRILL-6312
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators, Query Planning &amp; 
> Optimization
>    Affects Versions: 1.13.0
>            Reporter: Hanumath Rao Maduri
>            Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate 
> sources at the read time. Currently the scanners infer the schema for each 
> batch depending upon the data for that column in the corresponding batch. 
> This solves many uses cases but can error out when the data is too different 
> between batches like int and array[int] etc... (There are other cases as well 
> but just to give one example).
> There is also a mechanism to create a view by type casting the columns to 
> appropriate type. This solves issues in some cases but fails in many other 
> cases. This is due to the fact that cast expression is not being pushed down 
> to the scanner but staying at the project or filter etc operators up the 
> query plan.
> This JIRA is to fix this by propagating the type information embedded in the 
> cast function to the scanners so that scanners can cast the incoming data 
> appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

Reply via email to