error418 commented on issue #35297:
URL: https://github.com/apache/airflow/issues/35297#issuecomment-1843592068

   Thank you for the clarifications, @blag
   
   I indeed forgot to mention my use case which led me to this issue:
   
   ```mermaid
   flowchart TD
   
     ds[Dataset]
     s3[(S3)]
   
     subgraph dag1[DAG A]
       task1[Task 1]
     end
   
     subgraph dag2[DAG B]
       task2[Task 2]
     end
   
     task1 -- outlets --> ds
     task1 -- writes --> s3
     s3 --> task2
   
     ds -.- s3
     ds -- triggers --> dag2
   ```
   
   After reading the docs about data-aware scheduling I thought it would be the 
perfect fit to orchestrate tasks around a S3 datalake workflow.
   
   The S3 keys are for example organized like so:
   
   ```
   /path/to/dataset/2020-01-01.avro
   /path/to/dataset/2020-01-02.avro
   /path/to/dataset/2020-01-03.avro
   ...
   ```
   `Task 1` processes information, puts a result file to the S3 bucket using 
the path schema above and emits a Dataset on its outlets.
   The Airflow Dataset would have the uri `s3://path/to/dataset` with an extra 
key `s3_key` pointing to the newly put file.
   
   `DAG B` listens for Datasets with the uri `s3://path/to/dataset`, triggers, 
and runs `Task 2`, which reads the s3 key from `extra` and retrieves the file 
using a S3 Hook.
   
   In my opinion this would be a nice feature to have for data-aware 
scheduling, which enables users to pass more specific information about what in 
a dataset has changed to be able to react in consuming tasks accordingly.
   
   This issue might then be more of a feature request than a bug. What is your 
opinion on this @blag?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to