[ https://issues.apache.org/jira/browse/HIVE-20633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Madhan Neethiraj updated HIVE-20633: ------------------------------------ Issue Type: Bug (was: New Feature) > Incorrect column lineage: each output column has input from *all columns* of > the input table > -------------------------------------------------------------------------------------------- > > Key: HIVE-20633 > URL: https://issues.apache.org/jira/browse/HIVE-20633 > Project: Hive > Issue Type: Bug > Components: HiveServer2 > Affects Versions: 1.2.2 > Reporter: Madhan Neethiraj > Priority: Critical > > Column lineage details made available to post hook is incorrect for certain > queries - like the following INSERT: > {noformat} > CREATE TABLE source_tbl(col_001 INT, col_002 INT, col_003 INT); > CREATE TABLE target_tbl(col_001 INT, col_002 INT, col_003 INT); > INSERT INTO target_tbl SELECT v1.col_001, v1.col_002, v1.col_003 FROM (SELECT > col_001, col_002, col_003, ROW_NUMBER() OVER() AS r_num FROM source_tbl) v1; > {noformat} > Below are the details of the lineage given to post hooks (like Atlas hook) > via HookContext.getLinfo(). It contains 3 entries, one for each target table > column. Note the dependency for each column has all columns of the source > tables. > {noformat} > DependencyKey=default.target_tbl:FieldSchema(name:col_001, type:int, > comment:null) > Dependency=[SCRIPT] > [default.source_tbl(src):FieldSchema(name:col_001, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_002, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_003, type:int, > comment:null), > > default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, > type:bigint, comment:), > default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, > type:string, comment:), > default.source_tbl(src):FieldSchema(name:ROW__ID, > type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:) > ]; > > DependencyKey=default.target_tbl:FieldSchema(name:col_002, type:int, > comment:null) > Dependency=[SCRIPT] > [default.source_tbl(src):FieldSchema(name:col_001, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_002, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_003, type:int, > comment:null), > > default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, > type:bigint, comment:), > default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, > type:string, comment:), > default.source_tbl(src):FieldSchema(name:ROW__ID, > type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:) > ]; > > DependencyKey=default.target_tbl:FieldSchema(name:col_003, type:int, > comment:null) > Dependency=[SCRIPT] > [default.source_tbl(src):FieldSchema(name:col_001, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_002, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_003, type:int, > comment:null), > > default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, > type:bigint, comment:), > default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, > type:string, comment:), > default.source_tbl(src):FieldSchema(name:ROW__ID, > type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:) > ]; > {noformat} > When INSERT statement doesn't include "ROW_NUMBER() OVER() AS r_num", the > lineage details look correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005)