Vladislav Glinskiy created ATLAS-3655:
-----------------------------------------

             Summary: Create 'spark_application' type to avoid 'spark_process' 
from being updated for multiple operations
                 Key: ATLAS-3655
                 URL: https://issues.apache.org/jira/browse/ATLAS-3655
             Project: Atlas
          Issue Type: Task
            Reporter: Vladislav Glinskiy
             Fix For: 2.1.0, 3.0.0
         Attachments: Screenshot from 2020-03-03 16-09-39.png

Create 'spark_application' type to avoid 'spark_process' from being updated for 
multiple operations. Currently, Spark Atlas Connector uses 'spark_process' as a 
top-level type for a Spark session, thus it's being updated for multiple 
operations within the same session.

The following statements:
{code:java}
spark.sql("create table table_1(col1 int,col2 string)");
spark.sql("create table table_2 as select * from table_1");
{code}
result in the next correct lineage:

table1 ------> spark_process1 -------> table2

but executing similar statements in the same spark session:
{code:java}
spark.sql("create table table_3(col1 int,col2 string)"); 
spark.sql("create table table_4 as select * from table_3");
{code}
result in the same 'spark_process' being updated and the lineage now connects 
all the 4 tables(see screenshot in the attachments).

 

The proposal is to create a 'spark_application' entity and associate all 
'spark_process' entities (created within that session) to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to