error418 commented on issue #35297:
URL: https://github.com/apache/airflow/issues/35297#issuecomment-1843592068
Thank you for the clarifications, @blag
I indeed forgot to mention my use case which led me to this issue:
```mermaid
flowchart TD
ds[Dataset]
s3[(S3)]
subgraph dag1[DAG A]
task1[Task 1]
end
subgraph dag2[DAG B]
task2[Task 2]
end
task1 -- outlets --> ds
task1 -- writes --> s3
s3 --> task2
ds -.- s3
ds -- triggers --> dag2
```
After reading the docs about data-aware scheduling I thought it would be the
perfect fit to orchestrate tasks around a S3 datalake workflow.
The S3 keys are for example organized like so:
```
/path/to/dataset/2020-01-01.avro
/path/to/dataset/2020-01-02.avro
/path/to/dataset/2020-01-03.avro
...
```
`Task 1` processes information, puts a result file to the S3 bucket using
the path schema above and emits a Dataset on its outlets.
The Airflow Dataset would have the uri `s3://path/to/dataset` with an extra
key `s3_key` pointing to the newly put file.
`DAG B` listens for Datasets with the uri `s3://path/to/dataset`, triggers,
and runs `Task 2`, which reads the s3 key from `extra` and retrieves the file
using a S3 Hook.
In my opinion this would be a nice feature to have for data-aware
scheduling, which enables users to pass more specific information about what in
a dataset has changed to be able to react in consuming tasks accordingly.
This issue might then be more of a feature request than a bug. What is your
opinion on this @blag?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]