Re: [DISCUSS] Proposal for AIP-76 (Asset Partitions) Implementation

Hussein Awala Sat, 06 Sep 2025 17:01:25 -0700

Thanks Daniel for the review, and sorry if my proposal came across as
incomplete. That’s fair feedback — my initial focus was mainly on how
partitions are stored in the DB and how we can leverage them in the
scheduling logic. Since the mapped partitions part was already very clear
in the AIP, I considered its implementation to be relatively
straightforward once we have the DB tables ready (e.g., 24 hourly
partitions can be generated automatically from a daily run, 7 daily
partitions or 7×24 hourly partitions can be generated automatically from a
weekly run, etc.).

The AIP mentioned "Time-based Partitions", "Time-based Partitions" and
"Combined Partitions", which I tried to cover in my proposal. So even if we
follow your recommendation — *“for the initial release of asset
partitioning, I think we have to keep it as simple as we can and try to
avoid trying to do too much in the first introduction of the feature”* —
our DB tables and design should still support these scenarios without
breaking changes. I don’t see this as a YAGNI concern, since we plan to
support them soon. On the other hand, if we ignore them completely in the
design (for example, storing the partition in a Variant column as just a
list of one partition key), I believe this would lead to a bad design.

The AIP focused primarily on mapping (1 asset change triggering X DAG
runs), but it also highlighted in the *future work* section: partial
refresh, complex schedule–partition mappings, and complex partition
dependency rules. And for the completeness logic I proposed in my document:

*"The downstream language_model is scheduled against the upstream
hourly_data, but does not want to be materialized as often (perhaps due to
the materialization being expensive). This allows the downstream to still
“follow” the upstream’s schedule, instead of having an independent one and
worry about language_model being accidentally run too soon before the last
hourly_data finishes—a common problem with traditional DAGs that
necessitates a sensor in the beginning of the downstream.I have not decided
how best to implement the “skipping” part of this. The first 23 upstream
events still need to be handled in some way. This can be done by still
creating 23 runs but not actually running the task, or we can choose to not
create the runs at all. I feel we should still do something (doing nothing
may appear like Airflow is having hiccups to the user), but what exactly is
undecided yet."*

This was meant as an open implementation question. In my document I was
trying to suggest a more generic solution that could cover this scenario
and many others in an efficient, easy-to-implement way.

The main goal of my proposal is to highlight that the relationship between
asset events and scheduled runs will no longer be strictly 1:1 (as it is
today). Instead:

   -

   we may create multiple DAG runs from a single asset event (if the event
   includes multiple partitions),
   -

   we may create one DAG run from multiple asset events (via the
   completeness logic, where a DAG waits for a window to complete),
   -

   and in the simplest case, a single DAG run can still correspond to a
   single partition — which was the main focus in the initial AIP.

I’ll work on updating the document soon to explicitly include the missing
part (partition mapping) and add direct references/quotes from the initial
AIP, so the relationship between my suggestions and the existing AIP is
clearer.

One last point: cron-based DAGs are increasingly less common, as users
prefer event-driven scheduling — whether for internal dependencies,
external triggers via REST API, or asset watchers. In these three cases,
inferring partitions from the logical date is not always possible or
helpful. That’s why I tried to introduce a way to define changed partitions
at runtime, where sometimes partitions are propagated from an asset event
or from asset partitions used to schedule a DAG run.

Thanks again for the detailed review — I’ll make sure to address the
missing links to AIP-76 in the next revision.

Hussein

On Wed, Sep 3, 2025 at 1:11 AM Daniel Standish
<daniel.stand...@astronomer.io.invalid> wrote:

> Reviewed the proposal.
>
> I'll share some thoughts below.
>
> But one general thing I want to emphasize as we work towards implementation
> of the AIP is, this is not going to be easy.  Airflow has a lot of
> complexity at this point and a lot of interacting interfaces.  Dags, tasks,
> assets, watchers, assets that are defined on their own, assets that are
> defined as part of a dag, assets updated from triggers, asset aliases....
> And when we add partitioning into the mix, it sort of has the potential to
> multiply the complexity.  So, for the initial release of asset
> partitioning, I think we have to keep it as simple as we can and try to
> avoid trying to do too much in the first introduction of the feature.  I
> think we need to focus on the most basic scenario, namely, assets that are
> partitioned by time windows, and focus on how to implement that and
> reconciling all the implications for all the other interfaces that are ...
> implicated in that change.  I think that everything else, probably makes
> sense to defer until we get out that initial implementation of the core
> feature.  There will be enough to sort out with just that.  And even with
> just focusing on the most simple thing, I expect we'll have to come to the
> list a few times over the next few months to resolve questions about how to
> reconcile these things and what the behavior should be.
>
> Now, moving on to your proposal document specifically, one thing that
> stands out to me is you do not really engage in much dialogue with the
> existing AIP, AIP-76
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-76%2BAsset%2BPartitions
> >,
> (authored by TP and accepted last year).  Of course, all AIPs change during
> implementation.  But I'm just not sure how to interpret your proposal.  Is
> this meant to be added to the existing AIP?  Or are there components of the
> AIP you wish to replace with your proposal?  I think it would be helpful
> for you to engage more directly with the AIP, and be direct about what your
> goals are, what you think needs to be changed, how your proposal fits in
> etc.  Maybe better to frame it as specific proposed amendments to the AIP,
> rather than leaving it to us to figure out the implications.
>
> For example, you introduce a completeness concept to handle partition
> mapping (as it is sometimes called).  But the existing AIP already
> discusses its approach to partition mapping.  Here's an excerpt from the
> AIP:
>
> If you want a downstream to aggregate multiple partitions from the
> > upstream, you can do
> > @asset(schedule=hourly_data, partition=PartitionByInterval("@daily"))
> > def aggregated_daily_data():
> > ...
> > Every partition of this asset depends on 24 partitions of hourly_data of
> > the day.
>
> So, the existing AIP says that by default, the daily asset should be mapped
> to the 24 hourly partitions that align with the partition implied by the
> daily partition scheme.
>
> Interestingly, dagster has a partition mapping interface, and if you don't
> provide it, it doesn't assume there should be any mapping.  I kindof like
> that approach (explicit over implicit).  And I like the language of
> partition mapping better than the "completeness" language.
>
> You also propose that asset event producers can emit partition info along
> with the asset event.  Which seems reasonable enough.  But, here too, TP
> already provided in the AIP a mechanism for an asset to record what
> partition it's dealing with (in the case of "dynamic" partitions).  And
> otherwise, shouldn't the asset already know what partition it's supposed to
> be dealing with?
>
> Thanks
>
> On Tue, Sep 2, 2025 at 8:39 AM Constance Martineau <
> consta...@astronomer.io>
> wrote:
>
> > Hi Hussein,
> >
> > Thanks for creating this. @Daniel Standish <
> daniel.stand...@astronomer.io>
> > , @Tzu-ping Chung <t...@astronomer.io> and I (well, mostly them :) ) will
> > take a look. We have started defining an implementation plan, but it's
> > still early so perfect timing.
> >
> > Constance
> >
> > On Sun, Aug 31, 2025 at 6:23 PM Hussein Awala <huss...@awala.fr> wrote:
> >
> >> Hi all,
> >>
> >> I’m not sure if the Astronomer team has already started work on
> >> implementing *AIP-76*, but I’ve prepared a proposal for how we could
> >> approach the implementation.
> >>
> >> The proposal covers:
> >>
> >>    -
> >>
> >>    Extending the asset/event model to support partitions
> >>    -
> >>
> >>    A normalized schema for asset event partitions
> >>    -
> >>
> >>    Watermark- and completeness-based scheduling (daily/weekly/monthly
> and
> >>    optional rolling windows)
> >>    -
> >>
> >>    Handling of re-processed partitions
> >>
> >> You can find the proposal document here:
> >>
> >>
> https://docs.google.com/document/d/17RMpjronpNerqHBN-KwNn0jjscrSYJzSDhAakentfd0/edit?usp=sharing
> >>
> >> I’d appreciate your feedback and review. My suggestion is that we start
> >> implementation after the *Airflow 3.1 release.*
> >>
> >>
> >> Looking forward to your thoughts,
> >>
> >> Hussein
> >>
> >
>

Re: [DISCUSS] Proposal for AIP-76 (Asset Partitions) Implementation

Reply via email to