Re: Support for Datasets with execution-time values for use cases requiring more fine-grained dataset specs

Elad Kalif Sat, 08 Jul 2023 23:12:53 -0700

Hi Jeff,

This is a valid use case but we need to think about how to approach this.
It may be trivial change (or it may not IDK) but what worries me is that we
have several other requests ( for example
https://github.com/apache/airflow/issues/30974 and
https://github.com/apache/airflow/issues/31013 ) each one is unique and we
need to find out a way to make sure it's easy and simple to setup Datasets
without causing side effects or parameters conflicts/madness.


I think we need to first discuss the mechanism that allows Datasets to
support the new use cases in a holistic manner rather than patching each
individual use case.

I don't know if this is AIP worthy. If you'd like to lead this I'm happy to
review.

On Thu, Jul 6, 2023 at 5:36 AM Jeff Payne <jpa...@bombora.com> wrote:

> Hello Airflow dev community!
>
> TL;DR;
> I'd like AF Datasets to support the following use case: I use a BigQuery
> table MyRawRecords, which is partitioned by date. DAG write_my_raw_records
> inserts records into said table on a daily schedule and is sometimes rerun
> to either correct previously inserted records or to insert late arriving
> data. There is a downstream DAG aggregate_my_raw_records_for_analysis owned
> by a separate team that runs daily and generates aggregate values against
> MyRawRecords. DAG B should be rerun anytime the MyRawRecords table is
> written to (new/updated data for a single partition). The existing Dataset
> mechanism doesn't support providing the target partition to the consuming
> DAG.
>
> At Bombora, the above is a common use case.  We produce many internally
> facing data products and would like to leverage a data/event-driven
> approach to triggering downstream DAGs without the explicit coupling of
> TriggerDagOperator/ExternalTaskOperator or the implicit coupling of
> schedule alignment with sensor timeouts.
>
> Currently, to leverage Datasets, this use case requires the consuming DAG
> to figure out which target partition of the table has been updated.  Of
> course, the aggregates for the entire table could be recalculated, but this
> is a huge waste of resources.
>
> It seems like this may be one of many logical next steps with respect to
> Dataset related features.  There has been interest in supporting this use
> case voiced in the comments on the original Dataset AIP,
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling.
> Specifically, this and follow up comments:
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling?focusedCommentId=217385741#comment-217385741
> .
>
> I have not looked at the Dataset related code and am not sure what the
> complexity of something like this would be.  I assume it wouldn't be
> trivial, given that the Datasets are currently static objects.
>
> Would love to hear some feedback.  Thanks!
>
> Jeff Payne
>

Re: Support for Datasets with execution-time values for use cases requiring more fine-grained dataset specs

Reply via email to