[
https://issues.apache.org/jira/browse/ATLAS-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Umesh Padashetty updated ATLAS-4746:
------------------------------------
Affects Version/s: 2.1.0
(was: 2.3.0)
> hive_process and hive_process_execution (lineage) being generated for simple
> DML UPDATE queries run via hive
> ------------------------------------------------------------------------------------------------------------
>
> Key: ATLAS-4746
> URL: https://issues.apache.org/jira/browse/ATLAS-4746
> Project: Atlas
> Issue Type: Bug
> Components: atlas-core
> Affects Versions: 2.1.0
> Reporter: Umesh Padashetty
> Priority: Critical
> Attachments: Screenshot 2023-05-02 at 6.28.18 PM.png, Screenshot
> 2023-05-02 at 6.28.34 PM.png, Screenshot 2023-05-02 at 6.29.05 PM.png,
> Screenshot 2023-05-02 at 6.29.10 PM.png, Screenshot 2023-05-02 at 6.47.19
> PM.png, Screenshot 2023-05-02 at 6.50.16 PM.png
>
>
> Queries ran:
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc;
> insert into test_hive_lineage_4 values ('qwer', '2');
> update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code}
> As you can see, these are simple DML queries, and not DDL
> We should NOT be tracking lineage for any of the DML ({*}SELECT, INSERT,
> DELETE, and UPDATE){*} queries NOR should we be tracking the audits.
> Jiras via which DML operations audits were skipped:
> * https://issues.apache.org/jira/browse/ATLAS-3188
> * https://issues.apache.org/jira/browse/ATLAS-3198
> But all the issues were related to audits and not the lineage. In all these
> cases, lineage was not generated for the DML UPDATE query
> But observing that we are now capturing lineage for simple DML Update query
> Relationship after running
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc; {code}
> !Screenshot 2023-05-02 at 6.28.18 PM.png!
> !Screenshot 2023-05-02 at 6.28.34 PM.png!
> As seen, there is no lineage generated. Good so far
> Then I ran
> {code:java}
> insert into test_hive_lineage_4 values ('qwer', '2'); {code}
> No lineage was generated. Good so far
> !Screenshot 2023-05-02 at 6.47.19 PM.png!
> Then I ran
> {code:java}
> update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code}
> This immediately generated a hive_process and a hive_process_execution
> Interestingly, hive_process with the following name was generated. As you can
> see, it has DELETE in the process name, when in reality this was an UPDATE
> DML. Another cause of concern?
> {code:java}
> QUERY:default.test_hive_lineage_4@cm:1683032252000->:DELETE:default.test_hive_lineage_4@cm:1683032252000
> {code}
> !Screenshot 2023-05-02 at 6.29.05 PM.png!
> !Screenshot 2023-05-02 at 6.29.10 PM.png!
> I then ran the same update query 100+ times, it created 100+ UNIQUE
> (timestamp delimited) hive_process_executions
> !Screenshot 2023-05-02 at 6.50.16 PM.png!
> This is a disaster since every UPDATE query now generates a
> process_execution.
> Customers can run 1000s of update queries, which are mostly of no use for
> atlas, but this issue is now leading to the generation of 1000s of
> process_executions
--
This message was sent by Atlassian Jira
(v8.20.10#820010)