Xiao-zhen-Liu commented on issue #3570: URL: https://github.com/apache/texera/issues/3570#issuecomment-3353998455
These CI failures were non-determinstic and happened occasionally on `DataProcessingSpec`. The cause was due to the test workflows not finishing execution within timeout: ``` [info] Engine [info] - should execute jsonl workflow normally *** FAILED *** [info] com.twitter.util.TimeoutException: 1.minutes [info] at com.twitter.util.Promise.ready(Promise.scala:680) [info] at com.twitter.util.Promise.result(Promise.scala:689) [info] at com.twitter.util.Await$.$anonfun$result$1(Awaitable.scala:155) [info] at com.twitter.concurrent.LocalScheduler$Activation.blocking(Scheduler.scala:189) [info] at com.twitter.concurrent.LocalScheduler.blocking(Scheduler.scala:256) [info] at com.twitter.concurrent.Scheduler$.blocking(Scheduler.scala:85) [info] at com.twitter.util.Await$.result(Awaitable.scala:155) [info] at edu.uci.ics.amber.engine.e2e.DataProcessingSpec.executeWorkflow(DataProcessingSpec.scala:124) [info] at edu.uci.ics.amber.engine.e2e.DataProcessingSpec.$anonfun$new$3(DataProcessingSpec.scala:159) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) [info] ... ``` Not sure why such timeouts happened as it could not be reproduced locally. However, I notice that after #3711, the timeout issues stopped happening. That PR changed the default iceberg catalog from Hadoop to Postgresql. `DataProcessingSpec` executes a workflow in each of its test cases, and during an execution, the iceberg catalog does need to be accessed. Instead, in all the recent CIs, the previous nondeterminstic timeout issue was replaced by another nondeterminstic issue where the some test cases fail because of `java.lang.Throwable: java.sql.SQLException: No suitable driver found for jdbc:postgresql://localhost:5432/texera_iceberg_catalog`, and it seems to happen more frequently. I tried this on a [PR](https://github.com/apache/texera/actions/runs/18113967249/job/51549275359) and it happens 3 out of 6 times. Affected test suites are `DataProcessingSpec` and `PauseSpec`. Both are e2e tests. Different from the previous issue, the new failure only happens to the first test case that needs to access postgres iceberg catalog (there are 12 such test cases across the 2 test suites. The order that these test cases run in the CI are is not deterministic, but when an error happens, it is always on the 1st test case of these 12 test cases.) One potential fix I tried is to explicit load the jdbc driver (in the code) at the initialization of these 2 test suites, and it seems to be working. I tried [multiple times](https://github.com/apache/texera/actions/runs/18118880091) and no random failures are happening any more. Although I am not sure of the root cause of this new non-determinstic jdbc driver loading issue, I suspect it might be related to `akka.Testkit`, as the 2 affected e2e tests both need to access jdbc inside an actor system. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
