Re: Support for Datasets with execution-time values for use cases requiring more fine-grained dataset specs

Julien Le Dem Tue, 11 Jul 2023 07:18:18 -0700

Hello Jeff,
I agree it is a very common use case for incremental processing.
For example you might schedule a DAG every hour that will use the data
interval to filter this partition in the output and overwrite a partition
in the output.
This also generalizes to more complex cases when you implement roll up
windows and, for example, read 24 previous days every day to create a new
daily average.
To do this we would need the Dataset updates to keep the context of what
data interval was updated and carry that to the downstream dataset update
as well.
I agree with Jarek, we should have a spec for this and the next iteration
of this feature.
Best
Julien


On Sun, Jul 9, 2023 at 9:19 AM Jeff Payne <jpa...@bombora.com> wrote:

> Thanks for responding.  That all sounds reasonable.  I don't know that I'm
> the right person to lead that, but I'm willing to give it a shot.  Let me
> know if you have someone else in mind.  In the meantime, I'll discuss this
> will a couple of work collogues about the effort.
> ________________________________
> From: Jarek Potiuk <ja...@potiuk.com>
> Sent: Saturday, July 8, 2023 11:39 PM
> To: dev@airflow.apache.org <dev@airflow.apache.org>
> Subject: Re: Support for Datasets with execution-time values for use cases
> requiring more fine-grained dataset specs
>
> Yeah. Agree with Elad. I think writing a follow-up AIP on a few cases for
> Datasets (that were discussed and specifically left out for late) is
> probably best thing we could do, and getting someone to lead it and writes
> such AIP would be great
>
> I personally think that datasets are not as useful or popular as they could
> be (though of course we have no good data to base it on, it's more of a
> gut feeling judging from a number of issues people raise about them (in
> general - very few).
>
> I think all of the above are the important areas that I personally think
> could help with more usages:
>
> * partial datasets (as in the original thread)
> * triggering datasets based on external dataset changes (banking on evident
> popularity of Deferrable and Triggers it feels that we could make a nice
> and efficient, customizable solution for that one)
> * time + dataset combined dependencies (those two issues mentioned by
> Elad).
>
> It would be great to have the Dataset case more complete with these three.
>
> J.
>
>
>
> On Sun, Jul 9, 2023 at 8:12 AM Elad Kalif <elad...@apache.org> wrote:
>
> > Hi Jeff,
> >
> > This is a valid use case but we need to think about how to approach this.
> > It may be trivial change (or it may not IDK) but what worries me is that
> we
> > have several other requests ( for example
> > https://github.com/apache/airflow/issues/30974 and
> > https://github.com/apache/airflow/issues/31013 ) each one is unique and
> we
> > need to find out a way to make sure it's easy and simple to setup
> Datasets
> > without causing side effects or parameters conflicts/madness.
> >
> > I think we need to first discuss the mechanism that allows Datasets to
> > support the new use cases in a holistic manner rather than patching each
> > individual use case.
> >
> > I don't know if this is AIP worthy. If you'd like to lead this I'm happy
> to
> > review.
> >
> > On Thu, Jul 6, 2023 at 5:36 AM Jeff Payne <jpa...@bombora.com> wrote:
> >
> > > Hello Airflow dev community!
> > >
> > > TL;DR;
> > > I'd like AF Datasets to support the following use case: I use a
> BigQuery
> > > table MyRawRecords, which is partitioned by date. DAG
> > write_my_raw_records
> > > inserts records into said table on a daily schedule and is sometimes
> > rerun
> > > to either correct previously inserted records or to insert late
> arriving
> > > data. There is a downstream DAG aggregate_my_raw_records_for_analysis
> > owned
> > > by a separate team that runs daily and generates aggregate values
> against
> > > MyRawRecords. DAG B should be rerun anytime the MyRawRecords table is
> > > written to (new/updated data for a single partition). The existing
> > Dataset
> > > mechanism doesn't support providing the target partition to the
> consuming
> > > DAG.
> > >
> > > At Bombora, the above is a common use case.  We produce many internally
> > > facing data products and would like to leverage a data/event-driven
> > > approach to triggering downstream DAGs without the explicit coupling of
> > > TriggerDagOperator/ExternalTaskOperator or the implicit coupling of
> > > schedule alignment with sensor timeouts.
> > >
> > > Currently, to leverage Datasets, this use case requires the consuming
> DAG
> > > to figure out which target partition of the table has been updated.  Of
> > > course, the aggregates for the entire table could be recalculated, but
> > this
> > > is a huge waste of resources.
> > >
> > > It seems like this may be one of many logical next steps with respect
> to
> > > Dataset related features.  There has been interest in supporting this
> use
> > > case voiced in the comments on the original Dataset AIP,
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling
> > .
> > > Specifically, this and follow up comments:
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling?focusedCommentId=217385741#comment-217385741
> > > .
> > >
> > > I have not looked at the Dataset related code and am not sure what the
> > > complexity of something like this would be.  I assume it wouldn't be
> > > trivial, given that the Datasets are currently static objects.
> > >
> > > Would love to hear some feedback.  Thanks!
> > >
> > > Jeff Payne
> > >
> >
>

Re: Support for Datasets with execution-time values for use cases requiring more fine-grained dataset specs

Reply via email to