w0ut0 opened a new issue, #39017:
URL: https://github.com/apache/airflow/issues/39017

   ### Description
   
   I would like to have the possibility to have a DAG trigger on ANY dataset 
matching a wildcard/regex.
   
   ### Use case/motivation
   
   We have 2 (similar) use cases.
   
   ## Trigger ingestion process on file arrival.
   Files land in blob storage, which triggers events like:
   ```
   {
   "filename": "/raw/my/folder1/2024/04/15/file1.csv",
   "event": "fileCreated"
   }
   ```
   and
   ```
   {
   "filename": "/raw/my/folder1/2024/04/15/file1.json",
   "event": "fileCreated"
   }
   ```
   and
   ```
   {
   "filename": "/raw/my/folder1/2024/02/10/file1.json",
   "event": "fileCreated"
   }
   ```
   
   These events are converted to Dataset update events, either by calling the 
Airflow API, or a DAG that polls a queue that contains the above events. In the 
above case, three Datasets would be updated:
   - `ingestionblob://raw/my/folder1/2024/04/15/file1.csv`
   - `ingestionblob://raw/my/folder1/2024/04/15/file1.json`
   - `ingestionblob://raw/my/folder1/2024/02/10/file1.json`
   
   Image that we have a DAG `ingestCSVFromRawRealTime`. I would like to be able 
to trigger that DAG on datasets with wildcards 
`ingestionblob://raw/my/folder1/{today.year}/{today.month}/{today.day}/*.csv`.
   
   Similarly for some DAGs like `ingestJsonFromRawRealTime` or 
`ingestBackfilledCsvFromRaw`.
   
   Ideally, we can create capture groups in the regex of the Dataset trigger, 
and use these capture groups as DAG parameter.
   
   ## Monitor database updates
   Our (Databricks) DWH, like almost all other dbms's, has a 3-level namespace: 
`database.schema.table`. We will create a mechanism to emit dataset events like
   - `database1.bronze.table7`
   - `database4.silver.view3`
   - `database2.gold.streamingTable1`
   - ...
   
   Sometimes, we want to be notified on updates to the whole database, schema 
or table. There are some use cases where having wildcard triggers would be 
beneficial:
   - Airflow needs to trigger a `sync` in our BI tool (eg. PowerBI, Tableau) if 
an object from the `gold` schema in `database2` is updated.
   - When data arrived in the `bronze` schema, some curation scripts needs to 
be kicked-off to generate a 'cleaned' silver table.
   
   
   
   ### Related issues
   
   - [34534](https://github.com/apache/airflow/issues/34534). But with the new 
logical operators for Datasets, we can already trigger on more than 1 dataset.
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to