andygrove opened a new pull request, #4360:
URL: https://github.com/apache/datafusion-comet/pull/4360

   ## Which issue does this PR close?
   
   Closes #4327.
   
   ## Rationale for this change
   
   The `spark-sql-auto-sql_core-1/spark-4.0.2-jdk21` job has been failing 
frequently with:
   
   ```
   [info] ParquetFileFormatV{1,2}Suite (or OrcSourceV{1,2}Suite) *** ABORTED ***
   [info]   \"There are 1 possibly leaked file streams.\" 
(SharedSparkSession.scala:189)
   ```
   
   This is Spark's `DebugFilesystem.assertNoOpenStreams()` in 
`SharedSparkSession.afterEach`, retried via `eventually` for ~10s.
   
   These four suites are flaky on JDK 21 even in apache/spark's own CI; the 
upstream workaround is `DEDICATED_JVM_SBT_TESTS`, which asks SBT to fork a 
dedicated JVM per listed suite. Comet's workflow already sets that env var for 
the Spark 4.0 row, but it has no effect because the workflow also 
unconditionally sets `SERIAL_SBT_TESTS=1` (added in #4285 to cap peak memory on 
standard runners). In `project/SparkBuild.scala`:
   
   ```scala
   if (!sys.env.contains(\"SERIAL_SBT_TESTS\")) {
     allProjects.foreach(enable(SparkParallelTestGrouping.settings))
   }
   ```
   
   `SparkParallelTestGrouping.settings` is the only consumer of 
`DEDICATED_JVM_SBT_TESTS`, so when `SERIAL_SBT_TESTS` is set the env var is 
read into an unused set and every suite runs in one shared forked JVM. 
Cross-suite state accumulation from ~11k prior tests is what trips the 
file-stream leak detection.
   
   Evidence from a failing log ([run 
26004020697](https://github.com/apache/datafusion-comet/actions/runs/26004020697/job/76441382377)):
   - thread-leak warning mentions 
`readingParquetFooters-ForkJoinPool-12260-worker-1` (high pool counter = many 
prior suites in this JVM)
   - unrelated suites (`AnalysisConfOverrideSuite`, 
`TPCDSModifiedPlanStabilityWithStatsSuite`, state-store suites) appear in the 
same JVM right before these Parquet/ORC suites
   - no SBT fork-restart markers between unrelated suites
   
   ## What changes are included in this PR?
   
   - Drop `SERIAL_SBT_TESTS=1` only for the Spark 4.0.2/JDK 21 matrix row, so 
Spark's `SparkParallelTestGrouping` is installed and `DEDICATED_JVM_SBT_TESTS` 
actually forks a separate JVM per listed suite. Other rows keep 
`SERIAL_SBT_TESTS=1` so their memory profile is unchanged.
   - Override `Global / concurrentRestrictions` to cap parallel forked test 
JVMs at 1. Spark's defaults would otherwise allow up to 4 in parallel (each 
`-Xmx2g`), which would exceed the 7 GB runner budget. The cap is a no-op for 
rows that still have `SERIAL_SBT_TESTS=1`.
   
   ## How are these changes tested?
   
   CI on this PR will exercise the new path. A successful run for 
`spark-sql-auto-sql_core-1/spark-4.0.2-jdk21` confirms that the dedicated-JVM 
workaround is now in effect and the four problem suites no longer abort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to