jscheffl commented on issue #37810: URL: https://github.com/apache/airflow/issues/37810#issuecomment-1979621553
> I like the idea. How would this work if the task writes to more than one dataset though? I believe might be an option as extension to also be able to pick which XCom as alternative output from the task is used to fill the extra. Might be another increment or if there is a concrete demand it could be made also right here. > Another thing I’ve been thinking is to give XCom a dataset URI so we can track lineage of its values (also tieing back to the read/write to XCom via Object Store idea). This raises a question, what should we do if we want to use XCom for both the “actual” data, if it is already used for extra? I understand the idea of XCom with dataset URI. But would this URI refer to a specific DAG run or be the abstract "last" run? One would be a "moving target" and the other would be "a dataset URI per run"==many many URIs to track... or I mis-understand. Can you give an example? > Eventually what I think we should do is to provide some sort of “output management” mechanism that generalises XCom—if XCom is a kind of dataset, its metadata is conceptually just automatically populated dataset metadata. So the return value should still be the actual data we want to write (with where and how the data is stored being customisable), and downstream tasks depend on, and metadata should be provided by another way. I’m not entire sure how the end result should look like, or how to smoothly transition toward it. When you ask this question I understand this would add a new complex area of XCom management and data flow. At the moment Xcom is quite simple to be used as Key/Value pair to pass data. It is not conforming to a schema (e.g. JSON validation/pydantic model) and can be any type. I don't see XCom as being a dataset per-se, it is just a data fragment passed as output for some other input. Within a DAG between tasks. The code idea is if it can be used between tasks, why not use the same facility between DAGs if data triggered? Still schema checks and other features for XCom can be added - independent of the Dataset trigger mechanism also for regular use in DAGs between tasks? I see it independent of this concept. Also the DAG author would have the full freedom (if there is a need to manage data structure or so) to have a PythonTask to proxy XCom information or re-structure it before the dataset is triggered. I believe there is no critical need for complexity that another python task in the workflow can also fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
