Re: [DISCUSS] AIP-76 Asset Partitions

Daniel Standish Mon, 29 Jul 2024 11:32:31 -0700

>
> We should clarify in the AIP doc that the proposed partitioning feature is
> not designed specifically to handle incremental loads in the traditional
> sense. Instead, it is intended to manage and process data in defined
> segments or partitions.



Agree.


> However, partitions can be used in conjunction with incremental loading
> strategies. For example, a time-based partitioning scheme can ensure that
> only data from relevant time periods is processed,
> *and within thosepartitions, incremental updates can be tracked and
> processed.*


I'm not sure what you mean by this, particularly the bit I emphasized.  Can
you try to clarify?


On Mon, Jul 29, 2024 at 11:01 AM Kaxil Naik <kaxiln...@gmail.com> wrote:

> Yeah, TP and I discussed that we aren't solving the incremental load
> problem; folks can use it to achieve it similar to how you achieved it by
> storing the Watermark in Variables and we can natively support it with a
> revised AIP-30 in one of the minor releases for Airflow 3.
>
> We should clarify in the AIP doc that the proposed partitioning feature is
> not designed specifically to handle incremental loads in the traditional
> sense. Instead, it is intended to manage and process data in defined
> segments or partitions.
>
> However, partitions can be used in conjunction with incremental loading
> strategies. For example, a time-based partitioning scheme can ensure that
> only data from relevant time periods is processed, and within those
> partitions, incremental updates can be tracked and processed.
>
> Regards,
> Kaxil
>
>
>
> On Mon, 29 Jul 2024 at 18:00, Daniel Standish
> <daniel.stand...@astronomer.io.invalid> wrote:
>
> > Hi,
> >
> > *1. incremental loads*
> >
> > There is mention of incremental processing / incremental loads in the
> doc.
> >
> > E.g.
> >
> > This is particularly useful for large datasets that need to be processed
> > > incrementally or updated periodically.
> >
> >
> > And
> >
> > > Facilitating Incremental Processing: Many modern data processing
> > > strategies rely on incremental updates
> >
> >
> > But there are no examples re how this solves for that use case.
> >
> > I think it's actually not good to think of or talk about incremental
> loads
> > as "partitioned".
> >
> > Let me explain.
> >
> > An incremental load might track an `updated_at` column.  The data it
> > processes is the data with an updated `updated_at` column.  But you would
> > not be correct in calling this a partition of data.  Because when the
> data
> > is updated again, it would now be in another partition.  That's not what
> > partitioning is.
> >
> > If this is supposed to solve for incremental loads, I think an example is
> > needed.  If it's not, let's call it out explicitly and say this is not
> > solving for incremental loads.
> >
> > *2. support for tasks*
> >
> > I see this is specific to tasks defined with the asset syntax.  What's
> the
> > story with "normal" dags and tasks e.g. with task flow or classic
> > operators.  Is this AIP adding support only for assets?  Is there some
> plan
> > for that?
> >
>

Re: [DISCUSS] AIP-76 Asset Partitions

Reply via email to