[jira] [Commented] (ATLAS-4746) hive_process and hive_process_execution (lineage) being generated for simple DML UPDATE queries run via hive

2023-08-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750821#comment-17750821
 ] 

ASF subversion and git services commented on ATLAS-4746:


Commit cd1c3d02694fe60a749f0597580dd43808e3961f in atlas's branch 
refs/heads/branch-2.0 from Radhika Kundam
[ https://gitbox.apache.org/repos/asf?p=atlas.git;h=cd1c3d026 ]

ATLAS-4746: hive_process and hive_process_execution (lineage) being generated 
for simple DML UPDATE queries run via hive

Signed-off-by: radhikakundam 
(cherry picked from commit 1443f25a46784dfa63892daa74744b19b5c2a71e)


> hive_process and hive_process_execution (lineage) being generated for simple 
> DML UPDATE queries run via hive
> 
>
> Key: ATLAS-4746
> URL: https://issues.apache.org/jira/browse/ATLAS-4746
> Project: Atlas
>  Issue Type: Bug
>  Components:  atlas-core
>Affects Versions: 2.3.0
>Reporter: Umesh Padashetty
>Assignee: Radhika Kundam
>Priority: Critical
> Attachments: Screenshot 2023-05-02 at 6.28.18 PM.png, Screenshot 
> 2023-05-02 at 6.28.34 PM.png, Screenshot 2023-05-02 at 6.29.05 PM.png, 
> Screenshot 2023-05-02 at 6.29.10 PM.png, Screenshot 2023-05-02 at 6.47.19 
> PM.png, Screenshot 2023-05-02 at 6.50.16 PM.png
>
>
> Queries ran:
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc;
> insert into test_hive_lineage_4 values ('qwer', '2');
> update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code}
> As you can see, these are simple DML queries, and not DDL
> We should NOT be tracking lineage for any of the DML ({*}SELECT, INSERT, 
> DELETE, and UPDATE){*} queries NOR should we be tracking the audits. 
> Jiras via which DML operations audits were skipped:
>  * https://issues.apache.org/jira/browse/ATLAS-3188
>  * https://issues.apache.org/jira/browse/ATLAS-3198
> But all the issues were related to audits and not the lineage. In all these 
> cases, lineage was not generated for the DML UPDATE query
> But observing that we are now capturing lineage for simple DML Update query
> Relationship after running
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc; {code}
> !Screenshot 2023-05-02 at 6.28.18 PM.png!
> !Screenshot 2023-05-02 at 6.28.34 PM.png!
> As seen, there is no lineage generated. Good so far 
> Then I ran
> {code:java}
> insert into test_hive_lineage_4 values ('qwer', '2'); {code}
> No lineage was generated. Good so far 
> !Screenshot 2023-05-02 at 6.47.19 PM.png!
> Then I ran 
> {code:java}
> update test_hive_lineage_4 set name = 'vwxy' where id = 2;  {code}
> This immediately generated a hive_process and a hive_process_execution
> Interestingly, hive_process with the following name was generated. As you can 
> see, it has DELETE in the process name, when in reality this was an UPDATE 
> DML. Another cause of concern?
> {code:java}
> QUERY:default.test_hive_lineage_4@cm:1683032252000->:DELETE:default.test_hive_lineage_4@cm:1683032252000
>  {code}
> !Screenshot 2023-05-02 at 6.29.05 PM.png!
>   !Screenshot 2023-05-02 at 6.29.10 PM.png!
> I then ran the same update query 100+ times, it created 100+ UNIQUE 
> (timestamp delimited) hive_process_executions
> !Screenshot 2023-05-02 at 6.50.16 PM.png!
> This is a disaster since every UPDATE query now generates a 
> process_execution. 
> Customers can run 1000s of update queries, which are mostly of no use for 
> atlas, but this issue is now leading to the generation of 1000s of 
> process_executions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ATLAS-4746) hive_process and hive_process_execution (lineage) being generated for simple DML UPDATE queries run via hive

2023-08-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750820#comment-17750820
 ] 

ASF subversion and git services commented on ATLAS-4746:


Commit 1443f25a46784dfa63892daa74744b19b5c2a71e in atlas's branch 
refs/heads/master from Radhika Kundam
[ https://gitbox.apache.org/repos/asf?p=atlas.git;h=1443f25a4 ]

ATLAS-4746: hive_process and hive_process_execution (lineage) being generated 
for simple DML UPDATE queries run via hive

Signed-off-by: radhikakundam 


> hive_process and hive_process_execution (lineage) being generated for simple 
> DML UPDATE queries run via hive
> 
>
> Key: ATLAS-4746
> URL: https://issues.apache.org/jira/browse/ATLAS-4746
> Project: Atlas
>  Issue Type: Bug
>  Components:  atlas-core
>Affects Versions: 2.3.0
>Reporter: Umesh Padashetty
>Assignee: Radhika Kundam
>Priority: Critical
> Attachments: Screenshot 2023-05-02 at 6.28.18 PM.png, Screenshot 
> 2023-05-02 at 6.28.34 PM.png, Screenshot 2023-05-02 at 6.29.05 PM.png, 
> Screenshot 2023-05-02 at 6.29.10 PM.png, Screenshot 2023-05-02 at 6.47.19 
> PM.png, Screenshot 2023-05-02 at 6.50.16 PM.png
>
>
> Queries ran:
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc;
> insert into test_hive_lineage_4 values ('qwer', '2');
> update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code}
> As you can see, these are simple DML queries, and not DDL
> We should NOT be tracking lineage for any of the DML ({*}SELECT, INSERT, 
> DELETE, and UPDATE){*} queries NOR should we be tracking the audits. 
> Jiras via which DML operations audits were skipped:
>  * https://issues.apache.org/jira/browse/ATLAS-3188
>  * https://issues.apache.org/jira/browse/ATLAS-3198
> But all the issues were related to audits and not the lineage. In all these 
> cases, lineage was not generated for the DML UPDATE query
> But observing that we are now capturing lineage for simple DML Update query
> Relationship after running
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc; {code}
> !Screenshot 2023-05-02 at 6.28.18 PM.png!
> !Screenshot 2023-05-02 at 6.28.34 PM.png!
> As seen, there is no lineage generated. Good so far 
> Then I ran
> {code:java}
> insert into test_hive_lineage_4 values ('qwer', '2'); {code}
> No lineage was generated. Good so far 
> !Screenshot 2023-05-02 at 6.47.19 PM.png!
> Then I ran 
> {code:java}
> update test_hive_lineage_4 set name = 'vwxy' where id = 2;  {code}
> This immediately generated a hive_process and a hive_process_execution
> Interestingly, hive_process with the following name was generated. As you can 
> see, it has DELETE in the process name, when in reality this was an UPDATE 
> DML. Another cause of concern?
> {code:java}
> QUERY:default.test_hive_lineage_4@cm:1683032252000->:DELETE:default.test_hive_lineage_4@cm:1683032252000
>  {code}
> !Screenshot 2023-05-02 at 6.29.05 PM.png!
>   !Screenshot 2023-05-02 at 6.29.10 PM.png!
> I then ran the same update query 100+ times, it created 100+ UNIQUE 
> (timestamp delimited) hive_process_executions
> !Screenshot 2023-05-02 at 6.50.16 PM.png!
> This is a disaster since every UPDATE query now generates a 
> process_execution. 
> Customers can run 1000s of update queries, which are mostly of no use for 
> atlas, but this issue is now leading to the generation of 1000s of 
> process_executions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)