dstandish commented on PR #26376: URL: https://github.com/apache/airflow/pull/26376#issuecomment-1247241432
> My personal preference would be to pass around the DatasetModel and DatasetEvent objects directly ultimately, they are still private, so, if someone wants to change it back, well, there's no reason they couldn't. it's sort of a cosmetic decision. but really, the main reason this came up was.... and i forgot to mention this earlier... was because i was performanrce testing the insert logic, and, (1) doing it this way meant not having to call the sqlalchemy relationship attr repeatedly, and (2) it also meant that i didn't have to set up the upstream-downstream relationships on the dataset ... i could just provide whatever IDs i wanted. so it made it easier to test when structured this way, and that was the the real reason i went a head and changed it. ultimately it's not of great consequence. it's a difference of one argument in the signature. > At first glance it seems like it would be simpler to just add a foreign key to DDRQ that points to the dataset event row that caused it, but since we only really care about the timestamp on the DDRQ in the scheduler, we just store and use the dataset event timestamp (copied through to the DDRQ). Is that correct? right so... yeah ... keying DDRQ to the event that caused it... it does kindof make sense. but the thing is .... it could ultimately be _multiple_ events that end up getting mapped to the same dag run. so ultimately what we do is.... so if dag 1 depends on datasets A and B, then if A is updated 5 times then finally B is updated once.... all of those A updates and the one B update are associated with the dag run, and the way that we figure that out is ... all the A events came before the B one... you know... it's an ephemeral table so there isn't a need really for a persistent ongoing association between the two (event and queue) and... i guess including just the value (not the key) saves a join in a query that needs to be fast. ideally... i'd prefer it if we could avoid timestamp comparisons completely and have the event-dag-run association be determined even more precisely but when i was thinking about how to do it, it seemed a bit fraught due to race conditions but if you have ideas i'm happy to explore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
