[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types

Aman Sinha (JIRA) Thu, 03 May 2018 23:26:51 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463417#comment-16463417
 ]


Aman Sinha commented on DRILL-6383:
-----------------------------------

This link [1] has some details of  views being used by Tableau.  My 
understanding is that in the data exploration phase, the user previews a sample 
of the data and defines a view (this can also be done through Drill Explorer).  
So in that sense, there is indeed a query execution phase before defining the 
view but it is initiated by the user.  Since Drill does not have a CREATE TABLE 
DDL,  the CREATE VIEW is the closest abstraction which is used by Tableau.   
One option to improve the accuracy of the views is to extend the syntax of view 
creation to include a sampling clause which would basically mimic what the user 
was doing in the data exploration phase. 

I don't have experience with the client side, so I am not sure what would be 
the impact of removing some pieces of information from the view definition.   
At the very least column names should be there.  For the data type, If CAST is 
specified, then the data type should be there.  If we want to make nullability 
more accurate, we could mark it UNKNOWN by default but since the user knows 
best  - e.g the rowkey column in HBASE is non-nullable, they would mark it 
non-nullable. 

Also cc-ing [~aengelbrecht] who has worked with the Drill+BI tools to shed more 
light. 

[1] [http://drill.apache.org/docs/tableau-examples]

> View column types, modes are plan-time guesses, not actual types
> ----------------------------------------------------------------
>
>                 Key: DRILL-6383
>                 URL: https://issues.apache.org/jira/browse/DRILL-6383
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Create a view views and look at the list of columns within the view. You'll 
> see that they are often wrong in name, type and mode.
> Consider a very simple CSV file with headers:
> {noformat}
> custId,name,balance,status
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> 128,Barney,1.23,PAST DUE,30
> {noformat}
> Define the simplest possible view:
> {noformat}
> CREATE VIEW myView2 AS SELECT * FROM `csvh/cust.csvh`;
> {noformat}
> Then look at the view file:
> {noformat}
> {
>   "name" : "myView2",
>   "sql" : "SELECT *\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "**",
>     "type" : "DYNAMIC_STAR",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> It is clear that the view simply captured the plan-time list of the new 
> double-star for the wildcard. Since this is not a true type, it should not 
> have an `isNullable` attribute.
> OK, we have to spell out the columns:
> {noformat}
> CREATE VIEW myView3 AS SELECT custId  FROM `csvh/cust.csvh`;
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView3",
>   "sql" : "SELECT `custId`\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "custId",
>     "type" : "ANY",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> The name is correct. The type is `ANY`, which is wrong. Since this is a CSV 
> file, the column type is `VARCHAR`. Further, because this is a CSV file which 
> headers, the mode is REQUIRED, but is listed as nullable. To verify:
> {noformat}
> SELECT sqlTypeOf(custId), modeOf(custId) FROM myView3 LIMIT 1;
> +--------------------+-----------+
> |       EXPR$0       |  EXPR$1   |
> +--------------------+-----------+
> | CHARACTER VARYING  | NOT NULL  |
> +--------------------+-----------+
> {noformat}
> Now, let's try a CSV file without headers:
> {noformat}
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> {noformat}
> {noformat}
> CREATE VIEW myView4 AS SELECT columns FROM `csv/cust.csv`;
> SELECT * FROM myView4;
> +--------------------------------+
> |            columns             |
> +--------------------------------+
> | ["123","Fred","456.78"]        |
> | ["125","Betty","98.76","VIP"]  |
> +--------------------------------+
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView4",
>   "sql" : "SELECT `columns`\nFROM `csv/cust.csv`",
>   "fields" : [ {
>     "name" : "columns",
>     "type" : "ANY",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> This is almost non-sensical. `columns` is reported as type `ANY` and 
> nullable. But, `columns` is Repeated `VARCHAR` and repeated types cannot be 
> nullable.
> The conclusion is that the type information is virtually worthless and the 
> `isNullable` information is worse than worthless: it is plain wrong.
> The type information is valid only if the planner can inver types:
> {noformat}
> CREATE VIEW myView5 AS
>   SELECT CAST(custId AS INTEGER) AS custId FROM `csvh/cust.csvh`;
> {noformat}
> The view file:
> {noformat}
> {
>   "name" : "myView5",
>   "sql" : "SELECT CAST(`custId` AS INTEGER) AS `custId`\nFROM 
> `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "custId",
>     "type" : "INTEGER",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> Note that the `type` is inferred from the cast, but `isNullable` is wrong 
> because the underlying column is non-nullable:
> {noformat}
> SELECT modeOf(custId) FROM myView5 LIMIT 1;
> +-----------+
> |  EXPR$0   |
> +-----------+
> | NOT NULL  |
> +-----------+
> {noformat}
> Expected that Drill would run the underlying query as a `LIMIT 0` query to 
> extract the actual column types, and use that in the view.
> Or, expected that Drill would simply omit the column list from the view if 
> the data is meaningless.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6383) View column types, modes are plan-time guesses, not actual types

Reply via email to