Re: [DISCUSS] Proposal for AIP-76 (Asset Partitions) Implementation

Daniel Standish Thu, 04 Sep 2025 14:31:00 -0700

I would add, I have been looking into this AIP and doing some planning and
will likely be sharing some of our own proposed refinements in the coming
weeks as well.


On Thu, Sep 4, 2025 at 2:25 PM Daniel Standish <
[email protected]> wrote:

> I would just urge you to be focused and clear about the following things:
>
> 1. where you think the existing AIP is problematic, call it out
> specifically and if possible propose your alternative as an amendment
> 2. where you are more just concerned about supporting something in the
> future that is not part of initial scope, I would suggest you focus on the
> behaviors you want to support, and not necessarily prescribe the underlying
> database structure.  because there are many different ways that we can
> store the data, but ultimately the most important thing to agree on is the
> user-facing behavior.  The underlying data models can be changed.
>
> I think sometimes we can get a bit lost in the details of particular
> interfaces but what is primary is the behaviors that we need to support.
> And the behaviors we need to support will determine how we must organize
> the data in the database.
>
>
>
> On Thu, Sep 4, 2025 at 12:58 PM Hussein Awala <[email protected]> wrote:
>
>> Thanks Daniel for the review, and sorry if my proposal came across as
>> incomplete. That’s fair feedback — my initial focus was mainly on how
>> partitions are stored in the DB and how we can leverage them in the
>> scheduling logic. Since the mapped partitions part was already very clear
>> in the AIP, I considered its implementation to be relatively
>> straightforward once we have the DB tables ready (e.g., 24 hourly
>> partitions can be generated automatically from a daily run, 7 daily
>> partitions or 7×24 hourly partitions can be generated automatically from a
>> weekly run, etc.).
>>
>> The AIP mentioned "Time-based Partitions", "Time-based Partitions" and
>> "Combined Partitions", which I tried to cover in my proposal. So even if
>> we
>> follow your recommendation — *“for the initial release of asset
>> partitioning, I think we have to keep it as simple as we can and try to
>> avoid trying to do too much in the first introduction of the feature”* —
>> our DB tables and design should still support these scenarios without
>> breaking changes. I don’t see this as a YAGNI concern, since we plan to
>> support them soon. On the other hand, if we ignore them completely in the
>> design (for example, storing the partition in a Variant column as just a
>> list of one partition key), I believe this would lead to a bad design.
>>
>> The AIP focused primarily on mapping (1 asset change triggering X DAG
>> runs), but it also highlighted in the *future work* section: partial
>> refresh, complex schedule–partition mappings, and complex partition
>> dependency rules. And for the completeness logic I proposed in my
>> document:
>>
>> *"The downstream language_model is scheduled against the upstream
>> hourly_data, but does not want to be materialized as often (perhaps due to
>> the materialization being expensive). This allows the downstream to still
>> “follow” the upstream’s schedule, instead of having an independent one and
>> worry about language_model being accidentally run too soon before the last
>> hourly_data finishes—a common problem with traditional DAGs that
>> necessitates a sensor in the beginning of the downstream.I have not
>> decided
>> how best to implement the “skipping” part of this. The first 23 upstream
>> events still need to be handled in some way. This can be done by still
>> creating 23 runs but not actually running the task, or we can choose to
>> not
>> create the runs at all. I feel we should still do something (doing nothing
>> may appear like Airflow is having hiccups to the user), but what exactly
>> is
>> undecided yet."*
>>
>> This was meant as an open implementation question. In my document I was
>> trying to suggest a more generic solution that could cover this scenario
>> and many others in an efficient, easy-to-implement way.
>>
>> The main goal of my proposal is to highlight that the relationship between
>> asset events and scheduled runs will no longer be strictly 1:1 (as it is
>> today). Instead:
>>
>>    -
>>
>>    we may create multiple DAG runs from a single asset event (if the event
>>    includes multiple partitions),
>>    -
>>
>>    we may create one DAG run from multiple asset events (via the
>>    completeness logic, where a DAG waits for a window to complete),
>>    -
>>
>>    and in the simplest case, a single DAG run can still correspond to a
>>    single partition — which was the main focus in the initial AIP.
>>
>> I’ll work on updating the document soon to explicitly include the missing
>> part (partition mapping) and add direct references/quotes from the initial
>> AIP, so the relationship between my suggestions and the existing AIP is
>> clearer.
>>
>> One last point: cron-based DAGs are increasingly less common, as users
>> prefer event-driven scheduling — whether for internal dependencies,
>> external triggers via REST API, or asset watchers. In these three cases,
>> inferring partitions from the logical date is not always possible or
>> helpful. That’s why I tried to introduce a way to define changed
>> partitions
>> at runtime, where sometimes partitions are propagated from an asset event
>> or from asset partitions used to schedule a DAG run.
>>
>> Thanks again for the detailed review — I’ll make sure to address the
>> missing links to AIP-76 in the next revision.
>>
>> Hussein
>>
>> On Wed, Sep 3, 2025 at 1:11 AM Daniel Standish
>> <[email protected]> wrote:
>>
>> > Reviewed the proposal.
>> >
>> > I'll share some thoughts below.
>> >
>> > But one general thing I want to emphasize as we work towards
>> implementation
>> > of the AIP is, this is not going to be easy.  Airflow has a lot of
>> > complexity at this point and a lot of interacting interfaces.  Dags,
>> tasks,
>> > assets, watchers, assets that are defined on their own, assets that are
>> > defined as part of a dag, assets updated from triggers, asset
>> aliases....
>> > And when we add partitioning into the mix, it sort of has the potential
>> to
>> > multiply the complexity.  So, for the initial release of asset
>> > partitioning, I think we have to keep it as simple as we can and try to
>> > avoid trying to do too much in the first introduction of the feature.  I
>> > think we need to focus on the most basic scenario, namely, assets that
>> are
>> > partitioned by time windows, and focus on how to implement that and
>> > reconciling all the implications for all the other interfaces that are
>> ...
>> > implicated in that change.  I think that everything else, probably makes
>> > sense to defer until we get out that initial implementation of the core
>> > feature.  There will be enough to sort out with just that.  And even
>> with
>> > just focusing on the most simple thing, I expect we'll have to come to
>> the
>> > list a few times over the next few months to resolve questions about
>> how to
>> > reconcile these things and what the behavior should be.
>> >
>> > Now, moving on to your proposal document specifically, one thing that
>> > stands out to me is you do not really engage in much dialogue with the
>> > existing AIP, AIP-76
>> > <
>> >
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-76%2BAsset%2BPartitions
>> > >,
>> > (authored by TP and accepted last year).  Of course, all AIPs change
>> during
>> > implementation.  But I'm just not sure how to interpret your proposal.
>> Is
>> > this meant to be added to the existing AIP?  Or are there components of
>> the
>> > AIP you wish to replace with your proposal?  I think it would be helpful
>> > for you to engage more directly with the AIP, and be direct about what
>> your
>> > goals are, what you think needs to be changed, how your proposal fits in
>> > etc.  Maybe better to frame it as specific proposed amendments to the
>> AIP,
>> > rather than leaving it to us to figure out the implications.
>> >
>> > For example, you introduce a completeness concept to handle partition
>> > mapping (as it is sometimes called).  But the existing AIP already
>> > discusses its approach to partition mapping.  Here's an excerpt from the
>> > AIP:
>> >
>> > If you want a downstream to aggregate multiple partitions from the
>> > > upstream, you can do
>> > > @asset(schedule=hourly_data, partition=PartitionByInterval("@daily"))
>> > > def aggregated_daily_data():
>> > > ...
>> > > Every partition of this asset depends on 24 partitions of hourly_data
>> of
>> > > the day.
>> >
>> > So, the existing AIP says that by default, the daily asset should be
>> mapped
>> > to the 24 hourly partitions that align with the partition implied by the
>> > daily partition scheme.
>> >
>> > Interestingly, dagster has a partition mapping interface, and if you
>> don't
>> > provide it, it doesn't assume there should be any mapping.  I kindof
>> like
>> > that approach (explicit over implicit).  And I like the language of
>> > partition mapping better than the "completeness" language.
>> >
>> > You also propose that asset event producers can emit partition info
>> along
>> > with the asset event.  Which seems reasonable enough.  But, here too, TP
>> > already provided in the AIP a mechanism for an asset to record what
>> > partition it's dealing with (in the case of "dynamic" partitions).  And
>> > otherwise, shouldn't the asset already know what partition it's
>> supposed to
>> > be dealing with?
>> >
>> > Thanks
>> >
>> > On Tue, Sep 2, 2025 at 8:39 AM Constance Martineau <
>> > [email protected]>
>> > wrote:
>> >
>> > > Hi Hussein,
>> > >
>> > > Thanks for creating this. @Daniel Standish <
>> > [email protected]>
>> > > , @Tzu-ping Chung <[email protected]> and I (well, mostly them :) )
>> will
>> > > take a look. We have started defining an implementation plan, but it's
>> > > still early so perfect timing.
>> > >
>> > > Constance
>> > >
>> > > On Sun, Aug 31, 2025 at 6:23 PM Hussein Awala <[email protected]>
>> wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> I’m not sure if the Astronomer team has already started work on
>> > >> implementing *AIP-76*, but I’ve prepared a proposal for how we could
>> > >> approach the implementation.
>> > >>
>> > >> The proposal covers:
>> > >>
>> > >>    -
>> > >>
>> > >>    Extending the asset/event model to support partitions
>> > >>    -
>> > >>
>> > >>    A normalized schema for asset event partitions
>> > >>    -
>> > >>
>> > >>    Watermark- and completeness-based scheduling (daily/weekly/monthly
>> > and
>> > >>    optional rolling windows)
>> > >>    -
>> > >>
>> > >>    Handling of re-processed partitions
>> > >>
>> > >> You can find the proposal document here:
>> > >>
>> > >>
>> >
>> https://docs.google.com/document/d/17RMpjronpNerqHBN-KwNn0jjscrSYJzSDhAakentfd0/edit?usp=sharing
>> > >>
>> > >> I’d appreciate your feedback and review. My suggestion is that we
>> start
>> > >> implementation after the *Airflow 3.1 release.*
>> > >>
>> > >>
>> > >> Looking forward to your thoughts,
>> > >>
>> > >> Hussein
>> > >>
>> > >
>> >
>>
>

Re: [DISCUSS] Proposal for AIP-76 (Asset Partitions) Implementation

Reply via email to