iodone commented on PR #3185:
URL: 
https://github.com/apache/incubator-kyuubi/pull/3185#issuecomment-1216529596

   Two approaches are currently implemented:
   1. Spark SQL Engine , `set kyuubi.operation.lineage.enabled = true` to 
enable lineage information logging. The lineage information will be added to 
`SparkOperationEvent` and output to operation event json logger together. 
   ```json
   {
     "Event": "org.apache.kyuubi.engine.spark.events.SparkOperationEvent",
     "statementId": "ea7b2d7a-1301-4f82-aa79-797ccab43d4a",
     "statement": "select a as col0, b as col1 from test_table0;",
     "shouldRunAsync": false,
     "state": "FINISHED_STATE",
     "eventTime": 1660650146416,
     "createTime": 1660650145261,
     "startTime": 1660650145311,
     "completeTime": 1660650146415,
     "exception": null,
     "sessionId": "e8f06b0f-5c5c-40ef-90de-c77fb4432d93",
     "sessionUser": "work",
     "executionId": 3,
     "lineage": {
       "inputTables": ["default.test_table0"],
       "outputTables": [],
       "columnLineage": [
         ["col0", ["default.test_table0.a"]],
         ["col1", ["default.test_table0.b"]]
       ]
     },
     "eventType": "spark_operation"
   ```
   
   2. Spark Listener Extension Plugging, which is not strongly bound to Kyuubi 
Engine, can be tried to raw Spark. It has its own independent lineage 
information output, also output in json logger.
   ```json
   {
     "executionId": 2,
     "eventTime": 1660650415849,
     "lineage": {
       "inputTables": ["default.test_table0"],
       "outputTables": [],
       "columnLineage": [
         ["col0", ["default.test_table0.a"]],
         ["col1", ["default.test_table0.b"]]
       ]
     },
     "exception": null,
     "eventType": "operation_lineage"
   }
   ```
   
   In way 1, the lineage information can be combined with sql event information 
to easy to collection and statistics.
   In way 2, sql operation event logger and lineage event logger are 
independent of each other and can only be associated by `executionId`. After 
collection, we need to do the correlation process.
   
   Do we need to keep all these two ways? @yaooqinn @ulysses-you 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to