uranusjr commented on pull request #17891:
URL: https://github.com/apache/airflow/pull/17891#issuecomment-908224414


   This logic feels hacky to me. Even if the numbers of tasks are not the same, 
we can’t be sure it’s a duplicated ID either; the user might have renamed the 
file *and* made some modification. I think this is theoratically impossible to 
fix in the current structure.
   
   To really resolve this, we need a place to actually aggregate all the known 
DAG IDs during a DAG-parsing run. One possibility is to implement something 
like a standalone process exposing a message queue for every DAG-parsing 
process to send parsed DAG IDs to, and raises an error if it seems duplication.
   
   Another possibility is, instead of storing parsed DAGs directly to 
`SerialiazedDagModel`, DAG-parsing processes should save things to a different 
tables (say `ParsingDagModel`) when they are running. This table would be empty 
when a DAG-parsing round starts, so any duplicated IDs are guaranteed to be 
real duplication (barring some filesystem race condition edge cases which we 
don’t currently cover anyway). After all the parsing processes successfully 
finish parsing this round (without reporting duplication, this 
`ParsingDagModel` is dumped into `SerializedDagModel` and truncated for the 
next parsing round.
   
   Both would require some pretty involved change though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to