[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types

Aman Sinha (JIRA) Wed, 02 May 2018 23:15:31 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461973#comment-16461973
 ]


Aman Sinha commented on DRILL-6383:
-----------------------------------

Drill views don't execute the query (not even a limit 0) which is by design, 
afaik.  So the primary use case for views is the one with CAST expressions 
which as your view5 shows has the correct type inference.  The nullability is 
not expressible in a CAST, so Drill uses the default of 'less strict' 
nullability i.e always assume nullable.  For schema-less systems, I would think 
that this is a reasonable choice. 

One could argue that even doing a limit 0 query is no guarantee that the schema 
would match the type and mode of what is returned at run time since it depends 
on the order in which files are read.  

Note that for BI tools we recommend creating views with CAST and the tool (at 
least Tableau) wraps the entire complex query with a LIMIT 0 which hints Drill 
to use the type inferencing directly at planning time without going through 
even the build schema phase of execution.  

> View column types, modes are plan-time guesses, not actual types
> ----------------------------------------------------------------
>
>                 Key: DRILL-6383
>                 URL: https://issues.apache.org/jira/browse/DRILL-6383
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Create a view views and look at the list of columns within the view. You'll 
> see that they are often wrong in name, type and mode.
> Consider a very simple CSV file with headers:
> {noformat}
> custId,name,balance,status
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> 128,Barney,1.23,PAST DUE,30
> {noformat}
> Define the simplest possible view:
> {noformat}
> CREATE VIEW myView2 AS SELECT * FROM `csvh/cust.csvh`;
> {noformat}
> Then look at the view file:
> {noformat}
> {
>   "name" : "myView2",
>   "sql" : "SELECT *\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "**",
>     "type" : "DYNAMIC_STAR",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> It is clear that the view simply captured the plan-time list of the new 
> double-star for the wildcard. Since this is not a true type, it should not 
> have an `isNullable` attribute.
> OK, we have to spell out the columns:
> {noformat}
> CREATE VIEW myView3 AS SELECT custId  FROM `csvh/cust.csvh`;
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView3",
>   "sql" : "SELECT `custId`\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "custId",
>     "type" : "ANY",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> The name is correct. The type is `ANY`, which is wrong. Since this is a CSV 
> file, the column type is `VARCHAR`. Further, because this is a CSV file which 
> headers, the mode is REQUIRED, but is listed as nullable. To verify:
> {noformat}
> SELECT sqlTypeOf(custId), modeOf(custId) FROM myView3 LIMIT 1;
> +--------------------+-----------+
> |       EXPR$0       |  EXPR$1   |
> +--------------------+-----------+
> | CHARACTER VARYING  | NOT NULL  |
> +--------------------+-----------+
> {noformat}
> Now, let's try a CSV file without headers:
> {noformat}
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> {noformat}
> {noformat}
> CREATE VIEW myView4 AS SELECT columns FROM `csv/cust.csv`;
> SELECT * FROM myView4;
> +--------------------------------+
> |            columns             |
> +--------------------------------+
> | ["123","Fred","456.78"]        |
> | ["125","Betty","98.76","VIP"]  |
> +--------------------------------+
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView4",
>   "sql" : "SELECT `columns`\nFROM `csv/cust.csv`",
>   "fields" : [ {
>     "name" : "columns",
>     "type" : "ANY",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> This is almost non-sensical. `columns` is reported as type `ANY` and 
> nullable. But, `columns` is Repeated `VARCHAR` and repeated types cannot be 
> nullable.
> The conclusion is that the type information is virtually worthless and the 
> `isNullable` information is worse than worthless: it is plain wrong.
> The type information is valid only if the planner can inver types:
> {noformat}
> CREATE VIEW myView5 AS
>   SELECT CAST(custId AS INTEGER) AS custId FROM `csvh/cust.csvh`;
> {noformat}
> The view file:
> {noformat}
> {
>   "name" : "myView5",
>   "sql" : "SELECT CAST(`custId` AS INTEGER) AS `custId`\nFROM 
> `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "custId",
>     "type" : "INTEGER",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> Note that the `type` is inferred from the cast, but `isNullable` is wrong 
> because the underlying column is non-nullable:
> {noformat}
> SELECT modeOf(custId) FROM myView5 LIMIT 1;
> +-----------+
> |  EXPR$0   |
> +-----------+
> | NOT NULL  |
> +-----------+
> {noformat}
> Expected that Drill would run the underlying query as a `LIMIT 0` query to 
> extract the actual column types, and use that in the view.
> Or, expected that Drill would simply omit the column list from the view if 
> the data is meaningless.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types

Reply via email to