[ https://issues.apache.org/jira/browse/ATLAS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648477#comment-16648477 ]
ASF subversion and git services commented on ATLAS-2891: -------------------------------------------------------- Commit de172af372d84a34076f05022ddac43e82182444 in atlas's branch refs/heads/branch-1.0 from [~mad...@apache.org] [ https://git-wip-us.apache.org/repos/asf?p=atlas.git;h=de172af ] ATLAS-2891: updated hook notification processing with option to ignore potentially incorrect hive_column_lineage - #3 (cherry picked from commit 82e04037220a82ab985928221bc72ade80fc3be2) > Incorrect column lineage: each output column has input from *all columns* of > the input table > -------------------------------------------------------------------------------------------- > > Key: ATLAS-2891 > URL: https://issues.apache.org/jira/browse/ATLAS-2891 > Project: Atlas > Issue Type: Bug > Components: atlas-intg > Affects Versions: 0.8.2 > Reporter: Madhan Neethiraj > Assignee: Madhan Neethiraj > Priority: Critical > Fix For: 0.8.3 > > Attachments: ATLAS-2891-branch-0.8.patch, ATLAS-2891.png > > > Column lineage generated by Atlas Hive hook is incorrect for certain queries > - like the following INSERT: > {noformat} > CREATE TABLE source_tbl(col_001 INT, col_002 INT, col_003 INT); > CREATE TABLE target_tbl(col_001 INT, col_002 INT, col_003 INT); > INSERT INTO target_tbl SELECT v1.col_001, v1.col_002, v1.col_003 FROM (SELECT > col_001, col_002, col_003, ROW_NUMBER() OVER() AS r_num FROM source_tbl) v1; > {noformat} > In this case, lineage for each column in target_tbl shows input from all > columns in source_tbl. In this case, the lineage information provided to post > hooks (like Atlas hook) contains 3 entries, one for each column in > target_tbl. Note the dependency for each column has all columns of the > source_tbl. > {noformat} > DependencyKey=default.target_tbl:FieldSchema(name:col_001, type:int, > comment:null) > Dependency=[SCRIPT] > [default.source_tbl(src):FieldSchema(name:col_001, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_002, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_003, type:int, > comment:null), > > default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, > type:bigint, comment:), > default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, > type:string, comment:), > default.source_tbl(src):FieldSchema(name:ROW__ID, > type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:) > ]; > > DependencyKey=default.target_tbl:FieldSchema(name:col_002, type:int, > comment:null) > Dependency=[SCRIPT] > [default.source_tbl(src):FieldSchema(name:col_001, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_002, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_003, type:int, > comment:null), > > default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, > type:bigint, comment:), > default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, > type:string, comment:), > default.source_tbl(src):FieldSchema(name:ROW__ID, > type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:) > ]; > > DependencyKey=default.target_tbl:FieldSchema(name:col_003, type:int, > comment:null) > Dependency=[SCRIPT] > [default.source_tbl(src):FieldSchema(name:col_001, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_002, type:int, > comment:null), > default.source_tbl(src):FieldSchema(name:col_003, type:int, > comment:null), > > default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, > type:bigint, comment:), > default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, > type:string, comment:), > default.source_tbl(src):FieldSchema(name:ROW__ID, > type:struct<transactionId:bigint,bucketId:int,rowId:bigint>, comment:) > ]; > {noformat} > When INSERT statement doesn't include "ROW_NUMBER() OVER() AS r_num", the > lineage details look correct. > This issue is seen in Hive version 1; but not in Hive2 or Hive3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)