Lee-W commented on PR #41427: URL: https://github.com/apache/airflow/pull/41427#issuecomment-2287711473
> The timestamp-based filtering is hard to reason about and, sorta imprecise. Maybe we should take this opportunity to explore using parent-child relationships that don't rely on timestamps. > > By that I mean, when a dataset event results in a queue record, stamp the association in a table. > > So like add surrogate key autoincrementing integer `id` column to DDRQ, then... when dataset event results in ddrq creation, then we create the ddrq record and create a record in a mapping table (ddrq_id, dataset_event_id) > > Then we'd be able to know the association precisely. > > The challenge though is that we have have to deal with many concurrent writers. There is a race condition when creating the DDRQ record, and there's another one when the DDRQ record is "consumed". So we would have to deal with that. > > Given that challenge and complexity, and since timestamp comparision was "good enough", I did not go that route initially. But curious what y'all think. Is it worth exploring? I thought of it but I would > The timestamp-based filtering is hard to reason about and, sorta imprecise. Maybe we should take this opportunity to explore using parent-child relationships that don't rely on timestamps. > > By that I mean, when a dataset event results in a queue record, stamp the association in a table. > > So like add surrogate key autoincrementing integer `id` column to DDRQ, then... when dataset event results in ddrq creation, then we create the ddrq record and create a record in a mapping table (ddrq_id, dataset_event_id) > > Then we'd be able to know the association precisely. > > The challenge though is that we have have to deal with many concurrent writers. There is a race condition when creating the DDRQ record, and there's another one when the DDRQ record is "consumed". So we would have to deal with that. > > Given that challenge and complexity, and since timestamp comparision was "good enough", I did not go that route initially. But curious what y'all think. Is it worth exploring? I have considered it, but I might sugest we keep it as it is for now. The original previous DAG run method has already resolved most scenarios. This rare case, where the previous DAG run is removed, was only discovered long after this feature was introduced. This PR should be able to move us one step forward. In our current design, directly linking DDRQ and dataset events might introduce more complexity even before the race condition you mentioned. Also, we are now redesigning datasets as assets. If we're to explore the new way, we probably should do that during the assets change. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
