dstandish commented on PR #26376:
URL: https://github.com/apache/airflow/pull/26376#issuecomment-1247241432

   > My personal preference would be to pass around the DatasetModel and 
DatasetEvent objects directly
   
   ultimately, they are still private, so, if someone wants to change it back, 
well, there's no reason they couldn't.  it's sort of a cosmetic decision.
   
   but really, the main reason this came up was.... and i forgot to mention 
this earlier... was because i was performanrce testing the insert logic, and, 
(1) doing it this way meant not having to call the sqlalchemy relationship attr 
repeatedly, and (2) it also meant that i didn't have to set up the 
upstream-downstream relationships on the dataset ... i could just provide 
whatever IDs i wanted.  so it made it easier to test when structured this way, 
and that was the the real reason i went a head and changed it.  ultimately it's 
not of great consequence.  it's a difference of one argument in the signature.
   
   > At first glance it seems like it would be simpler to just add a foreign 
key to DDRQ that points to the dataset event row that caused it, but since we 
only really care about the timestamp on the DDRQ in the scheduler, we just 
store and use the dataset event timestamp (copied through to the DDRQ). Is that 
correct?
   
   right so... yeah ... keying DDRQ to the event that caused it... it does 
kindof make sense.  but the thing is .... it could ultimately be _multiple_ 
events that end up getting mapped to the same dag run.  so ultimately what we 
do is.... so if dag 1 depends on datasets A and B, then if A is updated 5 times 
then finally B is updated once.... all of those A updates and the one B update 
are associated with the dag run, and the way that we figure that out is ... all 
the A events came before the B one...
   you know... it's an ephemeral table so there isn't a need really for a 
persistent ongoing association between the two  (event and queue) and... i 
guess including just the value (not the key) saves a join in a query that needs 
to be fast.
   
   ideally... i'd prefer it if we could avoid timestamp comparisons completely 
and have the event-dag-run association be determined even more precisely but 
when i was thinking about how to do it, it seemed a bit fraught due to race 
conditions but if you have ideas i'm happy to explore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to