pawel-big-lebowski commented on PR #26089: URL: https://github.com/apache/flink/pull/26089#issuecomment-2626542070
Thank you @davidradl for your attention and comments provided. > I see that the document pointed to, has some side effects - are these still the case, if so do we need to document them? The linked google doc describes in details motivation behind this PR and proves column level lineage metadata can be obtained with this change. It also contains my concern as this change creates another instance of `JobStatusChangedListener`. Within `TableEnvironmentImpl` I was not able to reach the listeners from executors (https://github.com/apache/flink/pull/24754/ -> this PR introduced the listeners). I included a comment within the code. > I would like to see some tests showing the column lineage is produced as expected for different topologies. This PR provides an ability to enrich existing lineage metadata collected. The enrichment can be done only through jobId. That allows correlating column level lineage extracted from Calcite RelNode's with lineage from DefaultJobCreatedEvent, as both events share same jobId. Within this change, job status listeners are notified on all the `execEnv.executeAsync` calls, when job client is created. I don't think it's possible to test all the possible TableEnvironmentImpl method calls or all the SQLs possible. However, this change notifies the listener always when having jobId + queryOperation. > I see that spark implements CLL at the optimization level. Are we doing CCL on the optimized logical plan? If not what is the thinking around this? Although it may be useful to extract column level lineage from CompiledPlan, as it is fully optimized and one can directly execute the plan, I was not able to find the proper abstractions within CompiledPlan to achieve this. I found Calcite's RelNode enables CLL extraction and the same approach has been chosen for another project - flink-sql-lineage - https://github.com/HamaWhiteGG/flink-sql-lineage. I think there should be in future another `CompiledPlanEvent` that notifies the job listener as well. In this case, it will be the listener to decide whether to extract lineage metadata from RelNodes, CompiledPlan, etc. However, at the moment I didn't find the way to connect `CompiledPlan` with the JobId. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
