[jira] [Commented] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

Paul Rogers (JIRA) Sat, 07 Apr 2018 22:38:13 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429632#comment-16429632
 ]


Paul Rogers commented on DRILL-6312:
------------------------------------

The idea of using the cast statement came from [~tdunning], and is a very good 
one.

The idea can be generalized using ideas from [this 
paper|https://blog.acolyer.org/2015/08/03/towards-practical-gradual-typing/]. 
Cast is just a special case of a more general idea: top-down, then bottom-up 
typing. Drill already implements bottom-up typing: Drill starts with columns, 
then infers the overridden versions of functions based on arguments, and 
eventually arrives at the type of each column in the result set.

For example, if we have an expression {{a + b}}, the reader will figure out the 
types of {{a}} and {{b}}.  Perhaps {{a}} is an {{INT}} and {{b}} is a 
{{Float8}}. Through type inference, Drill will find a version of the {{add}} 
function that takes two {{Float8}} arguments. Next, Drill will infer that it 
can convert an {{INT}} to a {{Float8}}.

The idea here is to run the system in reverse, from the result set back out to 
the scan columns. For each expression (function) in the SELECT clause, infer 
the types of the input. If we have an the expression above, {{a + b}}, then we 
can scan all the available versions of the {{add}} function to determine the 
set of possible argument types. Since {{add}} has many versions, one for each 
numeric type, we'll need a way to say that the arguments must be numeric, 
though we don't care the specific type. So, label the inputs as the new 
abstract type {{Numeric}}.

We've now labeled the arguments {{a}} and {{b}} as {{Numeric}}. We pass that 
information into the Scan operator, say the JSON reader. Now, when JSON sees 
the first value of {{a}} as null, and finds that {{b}} is missing, JSON has 
context to choose the correct type; say {{Float8}} or {{BigInt}} (the two 
numeric types that JSON uses.)

As we can see, Cast is just a special case: one in which the type is narrowed 
down to one very specific type. That is {{CAST(a AS INT)}} says not just that 
{{a}} is numeric, but that it is {{Int}}.

While this is all very useful, it still leads to ambiguity. In the case above, 
if all we know is that {{a}} is numeric, the first reader, the one that sees as 
{{null}} value, can choose {{BigInt}}. But, if another reader (or a later 
record) actually has the value as {{Float8}}, we've still got problems.

The result is a "bounce" algorithm: do a top-down tree traversal of the parse 
tree to infer possible expression types. Then, at runtime, continue to use the 
bottom-up traversal to infer actual types.

> Enable pushing of cast expressions to the scanner for better schema discovery.
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-6312
>                 URL: https://issues.apache.org/jira/browse/DRILL-6312
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators, Query Planning &amp; 
> Optimization
>    Affects Versions: 1.13.0
>            Reporter: Hanumath Rao Maduri
>            Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate 
> sources at the read time. Currently the scanners infer the schema for each 
> batch depending upon the data for that column in the corresponding batch. 
> This solves many uses cases but can error out when the data is too different 
> between batches like int and array[int] etc... (There are other cases as well 
> but just to give one example).
> There is also a mechanism to create a view by type casting the columns to 
> appropriate type. This solves issues in some cases but fails in many other 
> cases. This is due to the fact that cast expression is not being pushed down 
> to the scanner but staying at the project or filter etc operators up the 
> query plan.
> This JIRA is to fix this by propagating the type information embedded in the 
> cast function to the scanners so that scanners can cast the incoming data 
> appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

Reply via email to