damondouglas commented on issue #32144:
URL: https://github.com/apache/beam/issues/32144#issuecomment-2521820063

   Unassigning myself but relaying my research on this ticket.
   
   # Situation
   
   This workflow's test failed roughly every 2 to 3 days in the past two weeks.
   
   # Background
   
   This workflow is scheduled to run twice daily. Recent inspection of the 
latest failures shows that a timeout (`Failed: Timeout >1800.0s`) when the 
actual Dataflow Job for that execution succeeded. The stack trace of each 
failure is not the same for the past two weeks' failures. In each build scans' 
timeline we see that 
`:sdks:python:test-suites:dataflow:py39:runPerformanceTest` takes `~30m` 
cutting off at the configured timeout.
   
   Said timeout is set on the `runPerformanceTest` gradle task per 
https://github.com/pytest-dev/pytest-timeout. Dataflow Jobs for these failed 
tests take `~10 to 13m`. Successful tests do not print out any information 
about the Dataflow Job to compare.
   
   There are additional tasks performed by the `_run_workcount_it` method such 
as cleanup and publishing metrics to BigQuery. Further analysis of the cleanup 
and publishing to metrics only requires information about artifacts and 
metadata generated during the test, such as the Job Id, Google Cloud storage 
files, etc. Notably, there's a usage of an influx DB to read and then write to 
BigQuery.
   
   # Assessment
   
   We can rule out any failing Dataflow Jobs as a root cause of the failure 
incidences. Moreover, there seems to be `~15m` of extra work outside the 
Dataflow Job execution that is being executed within the test code. There seems 
like a lot of unnecessary coupling of after test functions with running the 
test.
   
   # Recommendations
   
   - Remove the after test clean up and consider using a Google Cloud storage 
wildcard approach to schedule a deletion of test artifacts outside test 
execution.
   - Remove the influx DB read and write to BigQuery. Perhaps use a scheduled 
batch or streaming Pipeline to collect these results into BigQuery.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to