[jira] [Commented] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

Paul Rogers (JIRA) Sat, 07 Apr 2018 22:57:49 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429635#comment-16429635
 ]


Paul Rogers commented on DRILL-6312:
------------------------------------

While we are focussing on the type of pesky fields, data processing system 
often allow other forms of column definitions.

For example, it is often helpful to combine or split columns. Suppose I have a 
field like the following from a web log:

{noformat}
GET http://mySite.com/path/to/asset
{noformat}

I may want to split this into four field: HTTP operation ("GET"), service type 
("http"), host ("mySite.com") and asset ("/path/to/asset").

Or, I may have two fields that give the and time:

{noformat}
2018-04-07, 10:13:43.345
{noformat}

And I may want to combine them into a single date-time type.

A handy technique is to define a computed column that does the work. If the 
computed column can call a UDF, then pretty much any transform is possible. 
Here is a very simple case for a line item:

{noformat}
price * quantity AS extendedPrice
{noformat}


> Enable pushing of cast expressions to the scanner for better schema discovery.
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-6312
>                 URL: https://issues.apache.org/jira/browse/DRILL-6312
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators, Query Planning &amp; 
> Optimization
>    Affects Versions: 1.13.0
>            Reporter: Hanumath Rao Maduri
>            Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate 
> sources at the read time. Currently the scanners infer the schema for each 
> batch depending upon the data for that column in the corresponding batch. 
> This solves many uses cases but can error out when the data is too different 
> between batches like int and array[int] etc... (There are other cases as well 
> but just to give one example).
> There is also a mechanism to create a view by type casting the columns to 
> appropriate type. This solves issues in some cases but fails in many other 
> cases. This is due to the fact that cast expression is not being pushed down 
> to the scanner but staying at the project or filter etc operators up the 
> query plan.
> This JIRA is to fix this by propagating the type information embedded in the 
> cast function to the scanners so that scanners can cast the incoming data 
> appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

Reply via email to