Vladislav Glinskiy created ATLAS-3655: -----------------------------------------
Summary: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations Key: ATLAS-3655 URL: https://issues.apache.org/jira/browse/ATLAS-3655 Project: Atlas Issue Type: Task Reporter: Vladislav Glinskiy Fix For: 2.1.0, 3.0.0 Attachments: Screenshot from 2020-03-03 16-09-39.png Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations. Currently, Spark Atlas Connector uses 'spark_process' as a top-level type for a Spark session, thus it's being updated for multiple operations within the same session. The following statements: {code:java} spark.sql("create table table_1(col1 int,col2 string)"); spark.sql("create table table_2 as select * from table_1"); {code} result in the next correct lineage: table1 ------> spark_process1 -------> table2 but executing similar statements in the same spark session: {code:java} spark.sql("create table table_3(col1 int,col2 string)"); spark.sql("create table table_4 as select * from table_3"); {code} result in the same 'spark_process' being updated and the lineage now connects all the 4 tables(see screenshot in the attachments). The proposal is to create a 'spark_application' entity and associate all 'spark_process' entities (created within that session) to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)