[
https://issues.apache.org/jira/browse/DRILL-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463417#comment-16463417
]
Aman Sinha commented on DRILL-6383:
-----------------------------------
This link [1] has some details of views being used by Tableau. My
understanding is that in the data exploration phase, the user previews a sample
of the data and defines a view (this can also be done through Drill Explorer).
So in that sense, there is indeed a query execution phase before defining the
view but it is initiated by the user. Since Drill does not have a CREATE TABLE
DDL, the CREATE VIEW is the closest abstraction which is used by Tableau.
One option to improve the accuracy of the views is to extend the syntax of view
creation to include a sampling clause which would basically mimic what the user
was doing in the data exploration phase.
I don't have experience with the client side, so I am not sure what would be
the impact of removing some pieces of information from the view definition.
At the very least column names should be there. For the data type, If CAST is
specified, then the data type should be there. If we want to make nullability
more accurate, we could mark it UNKNOWN by default but since the user knows
best - e.g the rowkey column in HBASE is non-nullable, they would mark it
non-nullable.
Also cc-ing [~aengelbrecht] who has worked with the Drill+BI tools to shed more
light.
[1] [http://drill.apache.org/docs/tableau-examples]
> View column types, modes are plan-time guesses, not actual types
> ----------------------------------------------------------------
>
> Key: DRILL-6383
> URL: https://issues.apache.org/jira/browse/DRILL-6383
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.13.0
> Reporter: Paul Rogers
> Priority: Minor
>
> Create a view views and look at the list of columns within the view. You'll
> see that they are often wrong in name, type and mode.
> Consider a very simple CSV file with headers:
> {noformat}
> custId,name,balance,status
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> 128,Barney,1.23,PAST DUE,30
> {noformat}
> Define the simplest possible view:
> {noformat}
> CREATE VIEW myView2 AS SELECT * FROM `csvh/cust.csvh`;
> {noformat}
> Then look at the view file:
> {noformat}
> {
> "name" : "myView2",
> "sql" : "SELECT *\nFROM `csvh/cust.csvh`",
> "fields" : [ {
> "name" : "**",
> "type" : "DYNAMIC_STAR",
> "isNullable" : true
> } ],
> "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> It is clear that the view simply captured the plan-time list of the new
> double-star for the wildcard. Since this is not a true type, it should not
> have an `isNullable` attribute.
> OK, we have to spell out the columns:
> {noformat}
> CREATE VIEW myView3 AS SELECT custId FROM `csvh/cust.csvh`;
> {noformat}
> Let's look at the view file:
> {noformat}
> {
> "name" : "myView3",
> "sql" : "SELECT `custId`\nFROM `csvh/cust.csvh`",
> "fields" : [ {
> "name" : "custId",
> "type" : "ANY",
> "isNullable" : true
> } ],
> "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> The name is correct. The type is `ANY`, which is wrong. Since this is a CSV
> file, the column type is `VARCHAR`. Further, because this is a CSV file which
> headers, the mode is REQUIRED, but is listed as nullable. To verify:
> {noformat}
> SELECT sqlTypeOf(custId), modeOf(custId) FROM myView3 LIMIT 1;
> +--------------------+-----------+
> | EXPR$0 | EXPR$1 |
> +--------------------+-----------+
> | CHARACTER VARYING | NOT NULL |
> +--------------------+-----------+
> {noformat}
> Now, let's try a CSV file without headers:
> {noformat}
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> {noformat}
> {noformat}
> CREATE VIEW myView4 AS SELECT columns FROM `csv/cust.csv`;
> SELECT * FROM myView4;
> +--------------------------------+
> | columns |
> +--------------------------------+
> | ["123","Fred","456.78"] |
> | ["125","Betty","98.76","VIP"] |
> +--------------------------------+
> {noformat}
> Let's look at the view file:
> {noformat}
> {
> "name" : "myView4",
> "sql" : "SELECT `columns`\nFROM `csv/cust.csv`",
> "fields" : [ {
> "name" : "columns",
> "type" : "ANY",
> "isNullable" : true
> } ],
> "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> This is almost non-sensical. `columns` is reported as type `ANY` and
> nullable. But, `columns` is Repeated `VARCHAR` and repeated types cannot be
> nullable.
> The conclusion is that the type information is virtually worthless and the
> `isNullable` information is worse than worthless: it is plain wrong.
> The type information is valid only if the planner can inver types:
> {noformat}
> CREATE VIEW myView5 AS
> SELECT CAST(custId AS INTEGER) AS custId FROM `csvh/cust.csvh`;
> {noformat}
> The view file:
> {noformat}
> {
> "name" : "myView5",
> "sql" : "SELECT CAST(`custId` AS INTEGER) AS `custId`\nFROM
> `csvh/cust.csvh`",
> "fields" : [ {
> "name" : "custId",
> "type" : "INTEGER",
> "isNullable" : true
> } ],
> "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> Note that the `type` is inferred from the cast, but `isNullable` is wrong
> because the underlying column is non-nullable:
> {noformat}
> SELECT modeOf(custId) FROM myView5 LIMIT 1;
> +-----------+
> | EXPR$0 |
> +-----------+
> | NOT NULL |
> +-----------+
> {noformat}
> Expected that Drill would run the underlying query as a `LIMIT 0` query to
> extract the actual column types, and use that in the view.
> Or, expected that Drill would simply omit the column list from the view if
> the data is meaningless.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)