[
https://issues.apache.org/jira/browse/HIVE-20633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Madhan Neethiraj updated HIVE-20633:
------------------------------------
Issue Type: Bug (was: New Feature)
> Incorrect column lineage: each output column has input from *all columns* of
> the input table
> --------------------------------------------------------------------------------------------
>
> Key: HIVE-20633
> URL: https://issues.apache.org/jira/browse/HIVE-20633
> Project: Hive
> Issue Type: Bug
> Components: HiveServer2
> Affects Versions: 1.2.2
> Reporter: Madhan Neethiraj
> Priority: Critical
>
> Column lineage details made available to post hook is incorrect for certain
> queries - like the following INSERT:
> {noformat}
> CREATE TABLE source_tbl(col_001 INT, col_002 INT, col_003 INT);
> CREATE TABLE target_tbl(col_001 INT, col_002 INT, col_003 INT);
> INSERT INTO target_tbl SELECT v1.col_001, v1.col_002, v1.col_003 FROM (SELECT
> col_001, col_002, col_003, ROW_NUMBER() OVER() AS r_num FROM source_tbl) v1;
> {noformat}
> Below are the details of the lineage given to post hooks (like Atlas hook)
> via HookContext.getLinfo(). It contains 3 entries, one for each target table
> column. Note the dependency for each column has all columns of the source
> tables.
> {noformat}
> DependencyKey=default.target_tbl:FieldSchema(name:col_001, type:int,
> comment:null)
> Dependency=[SCRIPT]
> [default.source_tbl(src):FieldSchema(name:col_001, type:int,
> comment:null),
> default.source_tbl(src):FieldSchema(name:col_002, type:int,
> comment:null),
> default.source_tbl(src):FieldSchema(name:col_003, type:int,
> comment:null),
>
> default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE,
> type:bigint, comment:),
> default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME,
> type:string, comment:),
> default.source_tbl(src):FieldSchema(name:ROW__ID,
> type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
> ];
>
> DependencyKey=default.target_tbl:FieldSchema(name:col_002, type:int,
> comment:null)
> Dependency=[SCRIPT]
> [default.source_tbl(src):FieldSchema(name:col_001, type:int,
> comment:null),
> default.source_tbl(src):FieldSchema(name:col_002, type:int,
> comment:null),
> default.source_tbl(src):FieldSchema(name:col_003, type:int,
> comment:null),
>
> default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE,
> type:bigint, comment:),
> default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME,
> type:string, comment:),
> default.source_tbl(src):FieldSchema(name:ROW__ID,
> type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
> ];
>
> DependencyKey=default.target_tbl:FieldSchema(name:col_003, type:int,
> comment:null)
> Dependency=[SCRIPT]
> [default.source_tbl(src):FieldSchema(name:col_001, type:int,
> comment:null),
> default.source_tbl(src):FieldSchema(name:col_002, type:int,
> comment:null),
> default.source_tbl(src):FieldSchema(name:col_003, type:int,
> comment:null),
>
> default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE,
> type:bigint, comment:),
> default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME,
> type:string, comment:),
> default.source_tbl(src):FieldSchema(name:ROW__ID,
> type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
> ];
> {noformat}
> When INSERT statement doesn't include "ROW_NUMBER() OVER() AS r_num", the
> lineage details look correct.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)