Re: [PR] Fetch history dataset events based on DDRQ instead of the previous Dag Run [airflow]

via GitHub Tue, 13 Aug 2024 19:24:04 -0700


Lee-W commented on PR #41427:
URL: https://github.com/apache/airflow/pull/41427#issuecomment-2287711473


   > The timestamp-based filtering is hard to reason about and, sorta 
imprecise. Maybe we should take this opportunity to explore using parent-child 
relationships that don't rely on timestamps.
   > 
   > By that I mean, when a dataset event results in a queue record, stamp the 
association in a table.
   > 
   > So like add surrogate key autoincrementing integer `id` column to DDRQ, 
then... when dataset event results in ddrq creation, then we create the ddrq 
record and create a record in a mapping table (ddrq_id, dataset_event_id)
   > 
   > Then we'd be able to know the association precisely.
   > 
   > The challenge though is that we have have to deal with many concurrent 
writers. There is a race condition when creating the DDRQ record, and there's 
another one when the DDRQ record is "consumed". So we would have to deal with 
that.
   > 
   > Given that challenge and complexity, and since timestamp comparision was 
"good enough", I did not go that route initially. But curious what y'all think. 
Is it worth exploring?
   
   I thought of it but I would 
   
   > The timestamp-based filtering is hard to reason about and, sorta 
imprecise. Maybe we should take this opportunity to explore using parent-child 
relationships that don't rely on timestamps.
   > 
   > By that I mean, when a dataset event results in a queue record, stamp the 
association in a table.
   > 
   > So like add surrogate key autoincrementing integer `id` column to DDRQ, 
then... when dataset event results in ddrq creation, then we create the ddrq 
record and create a record in a mapping table (ddrq_id, dataset_event_id)
   > 
   > Then we'd be able to know the association precisely.
   > 
   > The challenge though is that we have have to deal with many concurrent 
writers. There is a race condition when creating the DDRQ record, and there's 
another one when the DDRQ record is "consumed". So we would have to deal with 
that.
   > 
   > Given that challenge and complexity, and since timestamp comparision was 
"good enough", I did not go that route initially. But curious what y'all think. 
Is it worth exploring?
   
   I have considered it, but I might sugest we keep it as it is for now. The 
original previous DAG run method has already resolved most scenarios. This rare 
case, where the previous DAG run is removed, was only discovered long after 
this feature was introduced. This PR should be able to move us one step forward.
   
   In our current design, directly linking DDRQ and dataset events might 
introduce more complexity even before the race condition you mentioned. Also, 
we are now redesigning datasets as assets. If we're to explore the new way, we 
probably should do that during the assets change. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fetch history dataset events based on DDRQ instead of the previous Dag Run [airflow]

Reply via email to