Re: Support for Datasets with execution-time values for use cases requiring more fine-grained dataset specs

Jeff Payne Sun, 09 Jul 2023 09:19:53 -0700

Thanks for responding.  That all sounds reasonable.  I don't know that I'm the 
right person to lead that, but I'm willing to give it a shot.  Let me know if 
you have someone else in mind.  In the meantime, I'll discuss this will a 
couple of work collogues about the effort.
________________________________
From: Jarek Potiuk <ja...@potiuk.com>
Sent: Saturday, July 8, 2023 11:39 PM
To: dev@airflow.apache.org <dev@airflow.apache.org>
Subject: Re: Support for Datasets with execution-time values for use cases 
requiring more fine-grained dataset specs


Yeah. Agree with Elad. I think writing a follow-up AIP on a few cases for
Datasets (that were discussed and specifically left out for late) is
probably best thing we could do, and getting someone to lead it and writes
such AIP would be great

I personally think that datasets are not as useful or popular as they could
be (though of course we have no good data to base it on, it's more of a
gut feeling judging from a number of issues people raise about them (in
general - very few).

I think all of the above are the important areas that I personally think
could help with more usages:

* partial datasets (as in the original thread)
* triggering datasets based on external dataset changes (banking on evident
popularity of Deferrable and Triggers it feels that we could make a nice
and efficient, customizable solution for that one)
* time + dataset combined dependencies (those two issues mentioned by Elad).

It would be great to have the Dataset case more complete with these three.

J.



On Sun, Jul 9, 2023 at 8:12 AM Elad Kalif <elad...@apache.org> wrote:

> Hi Jeff,
>
> This is a valid use case but we need to think about how to approach this.
> It may be trivial change (or it may not IDK) but what worries me is that we
> have several other requests ( for example
> https://github.com/apache/airflow/issues/30974 and
> https://github.com/apache/airflow/issues/31013 ) each one is unique and we
> need to find out a way to make sure it's easy and simple to setup Datasets
> without causing side effects or parameters conflicts/madness.
>
> I think we need to first discuss the mechanism that allows Datasets to
> support the new use cases in a holistic manner rather than patching each
> individual use case.
>
> I don't know if this is AIP worthy. If you'd like to lead this I'm happy to
> review.
>
> On Thu, Jul 6, 2023 at 5:36 AM Jeff Payne <jpa...@bombora.com> wrote:
>
> > Hello Airflow dev community!
> >
> > TL;DR;
> > I'd like AF Datasets to support the following use case: I use a BigQuery
> > table MyRawRecords, which is partitioned by date. DAG
> write_my_raw_records
> > inserts records into said table on a daily schedule and is sometimes
> rerun
> > to either correct previously inserted records or to insert late arriving
> > data. There is a downstream DAG aggregate_my_raw_records_for_analysis
> owned
> > by a separate team that runs daily and generates aggregate values against
> > MyRawRecords. DAG B should be rerun anytime the MyRawRecords table is
> > written to (new/updated data for a single partition). The existing
> Dataset
> > mechanism doesn't support providing the target partition to the consuming
> > DAG.
> >
> > At Bombora, the above is a common use case.  We produce many internally
> > facing data products and would like to leverage a data/event-driven
> > approach to triggering downstream DAGs without the explicit coupling of
> > TriggerDagOperator/ExternalTaskOperator or the implicit coupling of
> > schedule alignment with sensor timeouts.
> >
> > Currently, to leverage Datasets, this use case requires the consuming DAG
> > to figure out which target partition of the table has been updated.  Of
> > course, the aggregates for the entire table could be recalculated, but
> this
> > is a huge waste of resources.
> >
> > It seems like this may be one of many logical next steps with respect to
> > Dataset related features.  There has been interest in supporting this use
> > case voiced in the comments on the original Dataset AIP,
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling
> .
> > Specifically, this and follow up comments:
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling?focusedCommentId=217385741#comment-217385741
> > .
> >
> > I have not looked at the Dataset related code and am not sure what the
> > complexity of something like this would be.  I assume it wouldn't be
> > trivial, given that the Datasets are currently static objects.
> >
> > Would love to hear some feedback.  Thanks!
> >
> > Jeff Payne
> >
>

Re: Support for Datasets with execution-time values for use cases requiring more fine-grained dataset specs

Reply via email to