Madhan Neethiraj created HIVE-20633:
---------------------------------------

             Summary: Incorrect column lineage: each output column has input 
from *all columns* of the input table
                 Key: HIVE-20633
                 URL: https://issues.apache.org/jira/browse/HIVE-20633
             Project: Hive
          Issue Type: New Feature
          Components: HiveServer2
    Affects Versions: 1.2.2
            Reporter: Madhan Neethiraj


Column lineage details made available to post hook is incorrect for certain 
queries - like the following INSERT:

{noformat}
CREATE TABLE source_tbl(col_001 INT, col_002 INT, col_003 INT);

CREATE TABLE target_tbl(col_001 INT, col_002 INT, col_003 INT);

INSERT INTO target_tbl SELECT v1.col_001, v1.col_002, v1.col_003 FROM (SELECT 
col_001, col_002, col_003, ROW_NUMBER() OVER() AS r_num FROM source_tbl) v1;

{noformat}

Below are the details of the lineage given to post hooks (like Atlas hook) via 
HookContext.getLinfo(). It contains 3 entries, one for each target table 
column. Note the dependency for each column has all columns of the source 
tables.

{noformat}
DependencyKey=default.target_tbl:FieldSchema(name:col_001, type:int, 
comment:null)
Dependency=[SCRIPT]
           [default.source_tbl(src):FieldSchema(name:col_001, type:int, 
comment:null),
            default.source_tbl(src):FieldSchema(name:col_002, type:int, 
comment:null),
            default.source_tbl(src):FieldSchema(name:col_003, type:int, 
comment:null),
            
default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, 
type:bigint, comment:),
            default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, 
type:string, comment:),
            default.source_tbl(src):FieldSchema(name:ROW__ID, 
type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
           ];
 
DependencyKey=default.target_tbl:FieldSchema(name:col_002, type:int, 
comment:null)
Dependency=[SCRIPT]
           [default.source_tbl(src):FieldSchema(name:col_001, type:int, 
comment:null),
            default.source_tbl(src):FieldSchema(name:col_002, type:int, 
comment:null),
            default.source_tbl(src):FieldSchema(name:col_003, type:int, 
comment:null),
            
default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, 
type:bigint, comment:),
            default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, 
type:string, comment:),
            default.source_tbl(src):FieldSchema(name:ROW__ID, 
type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
           ];
 
DependencyKey=default.target_tbl:FieldSchema(name:col_003, type:int, 
comment:null)
Dependency=[SCRIPT]
           [default.source_tbl(src):FieldSchema(name:col_001, type:int, 
comment:null),
            default.source_tbl(src):FieldSchema(name:col_002, type:int, 
comment:null),
            default.source_tbl(src):FieldSchema(name:col_003, type:int, 
comment:null),
            
default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, 
type:bigint, comment:),
            default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, 
type:string, comment:),
            default.source_tbl(src):FieldSchema(name:ROW__ID, 
type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:)
           ];
{noformat}


When INSERT statement doesn't include "ROW_NUMBER() OVER() AS r_num", the 
lineage details look correct. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to