potiuk edited a comment on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799257444


   Hey @ashb @kaxil  -> The #14531 created a good opportunity to test some race 
conditions for some of the Flaky tests in scheduler and I would like to fix 
those flaky tests before I merge the parallel change.
   
   I extracted this change out of the 'parallel run change' those were my 
attempts to make the tests non-flaky. 
   
   The attempts helped - I managed to run successfully several test-passes 
without the problems - both in parallell self-hosted instances as well as in 
Public Runners, but I am not at all sure whether:
   
   a) this is a good fix
   b) whether I am not masking some real problem that might appear under heavy 
load in real Airflow production system
   
   Since you are the experts in this area, maybe you can try to answer those 
questions?
   
   I captured example flaky tests in #14778 #14773 #14772 (you can see stack 
traces there).
   
   I have several hypothesis why those flaky tests could appear. Maybe you can 
review/check/veriy my changes and the hypothesis:
   
   * Hypothesis 1):  There is a delay between scheduler actually parsing the 
dag and saving it into SerializedDagModel. Under heavy load it might get 
slightly delayed and hence the row is not saved there. My changes are assuming 
that this hypothesis is valid - but maybe the fix could be done better (I.e. 
waiting for the scheduler to complete)
   
   * Hypothesis 2): There is another background process resulting from other 
tests that cleans-up the SerializedDagModel and while previous tests are 
already finishied (they are run sequentially within the Core test type) some of 
the background proces is still running and removes the model. If that's the 
case, then this is a wrong fix. We need to find out a away to get rid of the 
cross-tests side effects between tests and find a way to kill all the 
background threads in tearDown/fixture of the tests.
   
   * Hypothesis 3): There is a delay in MySQL/Postgres DB between the actuall 
commit/flush and the time when the subsequent transaction can read the data. 
This is - to be honest most interesting one and if that hypothesis is true, my 
fix is not good either. We should find a way to be sure that the change is 
actually writtten to the database and available for other sessions to make the 
tests stable.
   
   
   I would love to hear your opinion @kaxil @ash - any other hypotheses? What 
do you think about those hypotheses I have and the fix? 
   
   I am happy to test it in my parallel-test change if you have other 
proposals. Those problems appear often enough to be reproducible there and I 
would love to wait with merging the parallel change until those flaky scheduler 
tests are fixed. 
   
   I am also happy to do similar step-by-step process for the other tests in 
scheduler (there few more marked as quarantine) but we can treat them 
one-by-one and fix all of them before I merge the parallel tests change
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to