Xiao-zhen-Liu commented on issue #3570:
URL: https://github.com/apache/texera/issues/3570#issuecomment-3353998455

   These CI failures were non-determinstic and happened occasionally on 
`DataProcessingSpec`. The cause was due to the test workflows not finishing 
execution within timeout:
   
   ```
   [info] Engine
   [info] - should execute jsonl workflow normally *** FAILED ***
   [info]   com.twitter.util.TimeoutException: 1.minutes
   [info]   at com.twitter.util.Promise.ready(Promise.scala:680)
   [info]   at com.twitter.util.Promise.result(Promise.scala:689)
   [info]   at com.twitter.util.Await$.$anonfun$result$1(Awaitable.scala:155)
   [info]   at 
com.twitter.concurrent.LocalScheduler$Activation.blocking(Scheduler.scala:189)
   [info]   at 
com.twitter.concurrent.LocalScheduler.blocking(Scheduler.scala:256)
   [info]   at com.twitter.concurrent.Scheduler$.blocking(Scheduler.scala:85)
   [info]   at com.twitter.util.Await$.result(Awaitable.scala:155)
   [info]   at 
edu.uci.ics.amber.engine.e2e.DataProcessingSpec.executeWorkflow(DataProcessingSpec.scala:124)
   [info]   at 
edu.uci.ics.amber.engine.e2e.DataProcessingSpec.$anonfun$new$3(DataProcessingSpec.scala:159)
   [info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
   [info]   ...
   ```
   
   Not sure why such timeouts happened as it could not be reproduced locally.
   
   However, I notice that after #3711, the timeout issues stopped happening. 
That PR changed the default iceberg catalog from Hadoop to Postgresql. 
`DataProcessingSpec` executes a workflow in each of its test cases, and during 
an execution, the iceberg catalog does need to be accessed.
   
   Instead, in all the recent CIs, the previous nondeterminstic timeout issue 
was replaced by another nondeterminstic issue where the some test cases fail 
because of `java.lang.Throwable: java.sql.SQLException: No suitable driver 
found for jdbc:postgresql://localhost:5432/texera_iceberg_catalog`, and it 
seems to happen more frequently. I tried this on a 
[PR](https://github.com/apache/texera/actions/runs/18113967249/job/51549275359) 
and it happens 3 out of 6 times.
   
   Affected test suites are `DataProcessingSpec` and `PauseSpec`. Both are e2e 
tests.
   
   Different from the previous issue, the new failure only happens to the first 
test case that needs to access postgres iceberg catalog (there are 12 such test 
cases across the 2 test suites. The order that these test cases run in the CI 
are is not deterministic, but when an error happens, it is always on the 1st 
test case of these 12 test cases.)
   
   One potential fix I tried is to explicit load the jdbc driver (in the code) 
at the initialization of these 2 test suites, and it seems to be working. I 
tried [multiple 
times](https://github.com/apache/texera/actions/runs/18118880091) and no random 
failures are happening any more.
   
   Although I am not sure of the root cause of this new non-determinstic jdbc 
driver loading issue, I suspect it might be related to `akka.Testkit`, as the 2 
affected e2e tests both need to access jdbc inside an actor system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to