dstandish commented on PR #41427: URL: https://github.com/apache/airflow/pull/41427#issuecomment-2286644077
The timestamp-based filtering is hard to reason about and, sorta imprecise. Maybe we should take this opportunity to explore using parent-child relationships that don't rely on timestamps. By that I mean, when a dataset event results in a queue record, stamp the association in a table. So like add surrogate key autoincrementing integer `id` column to DDRQ, then... when dataset event results in ddrq creation, then we create the ddrq record and create a record in a mapping table (ddrq_id, dataset_event_id) Then we'd be able to know the association precisely. The challenge though is that we have have to deal with many concurrent writers. There is a race condition when creating the DDRQ record, and there's another one when the DDRQ record is "consumed". So we would have to deal with that. Given that challenge and complexity, and since timestamp comparision was "good enough", I did not go that route initially. But curious what y'all think. Is it worth exploring? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
