Re: [PR] [FLINK-31275] extend lineage listener with QueryOperationEvent [flink]

via GitHub Fri, 31 Jan 2025 00:03:20 -0800


pawel-big-lebowski commented on PR #26089:
URL: https://github.com/apache/flink/pull/26089#issuecomment-2626542070


   Thank you @davidradl for your attention and comments provided.
   
   > I see that the document pointed to, has some side effects - are these 
still the case, if so do we need to document them?
   
   The linked google doc describes in details motivation behind this PR and 
proves column level lineage metadata can be obtained with this change. It also 
contains my concern as this change creates another instance of 
`JobStatusChangedListener`. Within `TableEnvironmentImpl` I was not able to 
reach the listeners from executors (https://github.com/apache/flink/pull/24754/ 
-> this PR introduced the listeners). I included a comment within the code. 
   
   > I would like to see some tests showing the column lineage is produced as 
expected for different topologies.
   
   This PR provides an ability to enrich existing lineage metadata collected. 
The enrichment can be done only through jobId. That allows correlating column 
level lineage extracted from Calcite RelNode's with lineage from 
DefaultJobCreatedEvent, as both events share same jobId. Within this change, 
job status listeners are notified on all the `execEnv.executeAsync` calls, when 
job client is created. 
   
   I don't think it's possible to test all the possible TableEnvironmentImpl 
method calls or all the SQLs possible. However, this change notifies the 
listener always when having jobId + queryOperation. 
   
   > I see that spark implements CLL at the optimization level. Are we doing 
CCL on the optimized logical plan? If not what is the thinking around this?
   
   Although it may be useful to extract column level lineage from CompiledPlan, 
as it is fully optimized and one can directly execute the plan, I was not able 
to find the proper abstractions within CompiledPlan to achieve this. I found 
Calcite's RelNode enables CLL extraction and the same approach has been chosen 
for another project - flink-sql-lineage - 
https://github.com/HamaWhiteGG/flink-sql-lineage. 
   
   I think there should be in future another `CompiledPlanEvent` that notifies 
the job listener as well. In this case, it will be the listener to decide 
whether to extract lineage metadata from RelNodes, CompiledPlan, etc. However, 
at the moment I didn't find the way to connect `CompiledPlan` with the JobId. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-31275] extend lineage listener with QueryOperationEvent [flink]

Reply via email to