Re: [I] Annotate a Dataset Event in the Source Task [airflow]

via GitHub Tue, 05 Mar 2024 12:55:58 -0800


jscheffl commented on issue #37810:
URL: https://github.com/apache/airflow/issues/37810#issuecomment-1979621553


   > I like the idea. How would this work if the task writes to more than one 
dataset though?
   
   I believe might be an option as extension to also be able to pick which XCom 
as alternative output from the task is used to fill the extra. Might be another 
increment or if there is a concrete demand it could be made also right here.
   
   > Another thing I’ve been thinking is to give XCom a dataset URI so we can 
track lineage of its values (also tieing back to the read/write to XCom via 
Object Store idea). This raises a question, what should we do if we want to use 
XCom for both the “actual” data, if it is already used for extra?
   
   I understand the idea of XCom with dataset URI. But would this URI refer to 
a specific DAG run or be the abstract "last" run? One would be a "moving 
target" and the other would be "a dataset URI per run"==many many URIs to 
track... or I mis-understand. Can you give an example?
   
   > Eventually what I think we should do is to provide some sort of “output 
management” mechanism that generalises XCom—if XCom is a kind of dataset, its 
metadata is conceptually just automatically populated dataset metadata. So the 
return value should still be the actual data we want to write (with where and 
how the data is stored being customisable), and downstream tasks depend on, and 
metadata should be provided by another way. I’m not entire sure how the end 
result should look like, or how to smoothly transition toward it.
   
   When you ask this question I understand this would add a new complex area of 
XCom management and data flow. At the moment Xcom is quite simple to be used as 
Key/Value pair to pass data. It is not conforming to a schema (e.g. JSON 
validation/pydantic model) and can be any type.
   I don't see XCom as being a dataset per-se, it is just a data fragment 
passed as output for some other input. Within a DAG between tasks. The code 
idea is if it can be used between tasks, why not use the same facility between 
DAGs if data triggered?
   Still schema checks and other features for XCom can be added - independent 
of the Dataset trigger mechanism also for regular use in DAGs between tasks? I 
see it independent of this concept.
   Also the DAG author would have the full freedom (if there is a need to 
manage data structure or so) to have a PythonTask to proxy XCom information or 
re-structure it before the dataset is triggered. I believe there is no critical 
need for complexity that another python task in the workflow can also fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Annotate a Dataset Event in the Source Task [airflow]

Reply via email to