[
https://issues.apache.org/jira/browse/DRILL-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Venki Korukanti updated DRILL-2342:
-----------------------------------
Attachment: DRILL-2342-1.patch
Attaching patch to store the nullability property of columns in view definition.
We don't know the types of the parquet files until execution. When creating
view we go up to rel conversion to get the output type of relation in view
definition. As the types are not known during planning, we return ANY which
defaults to NULLABLE. So even if the underlying file contains required types,
we still consider the field as NULLABLE. This can be changed to show the
correct nullability property, but requires reading the parquet file during
planning to get the schema which I believe is tracked in separate JIRA.
> Nullability property of the view created from parquet file is not correct
> -------------------------------------------------------------------------
>
> Key: DRILL-2342
> URL: https://issues.apache.org/jira/browse/DRILL-2342
> Project: Apache Drill
> Issue Type: Bug
> Components: Metadata
> Affects Versions: 0.8.0
> Reporter: Victoria Markman
> Assignee: Venki Korukanti
> Priority: Critical
> Fix For: 0.9.0
>
> Attachments: DRILL-2342-1.patch, t1.parquet
>
>
> Here is my t1 table definition:
> {code}
> message root {
> optional int32 a1;
> optional binary b1 (UTF8);
> optional int32 c1 (DATE);
> }
> {code}
> I created a view on top of it:
> {code}
> 0: jdbc:drill:schema=dfs> create view v1 as select cast(a1 as int), cast(b1
> as varchar(10)), cast(c1 as date) from t1;
> +------------+------------+
> | ok | summary |
> +------------+------------+
> | true | View 'v1' created successfully in 'dfs.aggregation' schema |
> +------------+------------+
> 1 row selected (0.096 seconds)
> {code}
> IS_NULLABLE says 'NO', which is incorrect.
> {code}
> 0: jdbc:drill:schema=dfs> describe v1;
> +-------------+------------+-------------+
> | COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
> +-------------+------------+-------------+
> | EXPR$0 | INTEGER | NO |
> | EXPR$1 | VARCHAR | NO |
> | EXPR$2 | DATE | NO |
> +-------------+------------+-------------+
> 3 rows selected (0.067 seconds)
> {code}
> It is dangerous potentially, because if Calcite decided to take advantage
> over this property tomorrow and create an optimization where if column is not
> nullable "is null" predicate can be dropped, query : "select * from v1 where
> x is null" would return incorrect result.
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select * from v1 where z is null;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(x=[$0], y=[$1], z=[$2])
> 00-02 SelectionVectorRemover
> 00-03 Filter(condition=[IS NULL($2)])
> 00-04 Project(x=[CAST($2):ANY NOT NULL], y=[CAST($1):ANY NOT
> NULL], z=[CAST($0):ANY NOT NULL])
> 00-05 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=maprfs:/aggregation/t1]],
> selectionRoot=/aggregation/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> {code}
> It seems to me that in views column properties should be always nullable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)