[PR] [SPARK-57370][SQL] Add a JDK (javac) compiler backend for whole-stage codegen [spark]

via GitHub Wed, 10 Jun 2026 06:39:47 -0700


LuciferYang opened a new pull request, #56430:
URL: https://github.com/apache/spark/pull/56430


   ### What changes were proposed in this pull request?
   
   This PR adds an internal config `spark.sql.codegen.compiler` that selects 
the compiler used to turn generated Java source into bytecode:
   
   - `janino` (default): the current behavior. The existing compile path in 
`CodeGenerator.doCompile` moved to `JaninoCodeCompiler` unchanged.
   - `jdk`: compiles with `javax.tools.JavaCompiler` from the running JDK.
   
   Generators keep emitting exactly the source they emit today; the JDK backend 
adapts it where javac is stricter than Janino. It wraps the class body into a 
real compilation unit, hoists class-body `import` lines, rewrites binary 
inner-class names (`Outer$Inner`) to the source form, and strips the explicit 
`Function1` bridge that Janino needs but javac synthesizes itself. Referenced 
classes are resolved through the task's context classloader (the same way 
Janino resolves them) rather than a `-classpath`, so classes that exist only on 
a runtime loader still work. javac runs on a dedicated thread so task 
interrupts cannot break its jar reads.
   
   A few shared codegen templates were adjusted to be legal under both 
compilers, e.g. explicit casts where Janino erases generics, and `catch 
(Throwable)` for invoked methods declaring `throws Throwable`. These are no-ops 
under Janino.
   
   Two kinds of code can never compile with javac and are always routed to 
Janino, decided up front rather than as a fallback after a failed compile: 
codegen in REPL / Connect interactive sessions (javac cannot resolve 
REPL-generated classes), and generated code referencing a class nested in a 
Scala `package object` (`package` is a Java reserved word that cannot be 
spelled in Java source). A one-time INFO log records each routing so the choice 
is visible to operators.
   
   Note: the default is temporarily flipped to `jdk` in this PR so that CI runs 
the entire test suite against the new backend. It will be reverted to `janino` 
before merge (marked with a TEMP comment in `SQLConf`).
   
   ### Why are the changes needed?
   
   Whole-stage codegen depends entirely on Janino, which is unmaintained 
upstream (last release 3.1.12, Feb 2024). This gives Spark an alternative that 
is maintained on the JDK release cadence, without changing the default: Janino 
stays preferred since it compiles small generated units 30-300x faster, but if 
a Janino bug or incompatibility ever blocks Spark, users can switch with one 
config instead of waiting on an upstream that may never release again.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The config is internal and the default behavior is unchanged (once the 
temporary CI-only default flip above is reverted).
   
   ### How was this patch tested?
   
   - New `CodeCompilerSuite` with 40 tests covering backend selection and 
routing, the source adaptations (import hoisting, bridge stripping, 
inner-class-name rewriting including string/comment safety), class resolution 
through enumerable and non-enumerable classloaders, compile-error type parity 
between the backends, per-backend separation in the compile cache, and 
interrupt isolation.
   - The full GHA matrix passes with the default flipped to `jdk`, i.e. the 
whole Spark test suite compiles and runs on javac-produced bytecode.
   - The Janino path is a code move; it was validated unchanged with 
`SQLQueryTestSuite` and the codegen suites under `janino` locally.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57370][SQL] Add a JDK (javac) compiler backend for whole-stage codegen [spark]

Reply via email to