LuciferYang opened a new pull request, #56430: URL: https://github.com/apache/spark/pull/56430
### What changes were proposed in this pull request? This PR adds an internal config `spark.sql.codegen.compiler` that selects the compiler used to turn generated Java source into bytecode: - `janino` (default): the current behavior. The existing compile path in `CodeGenerator.doCompile` moved to `JaninoCodeCompiler` unchanged. - `jdk`: compiles with `javax.tools.JavaCompiler` from the running JDK. Generators keep emitting exactly the source they emit today; the JDK backend adapts it where javac is stricter than Janino. It wraps the class body into a real compilation unit, hoists class-body `import` lines, rewrites binary inner-class names (`Outer$Inner`) to the source form, and strips the explicit `Function1` bridge that Janino needs but javac synthesizes itself. Referenced classes are resolved through the task's context classloader (the same way Janino resolves them) rather than a `-classpath`, so classes that exist only on a runtime loader still work. javac runs on a dedicated thread so task interrupts cannot break its jar reads. A few shared codegen templates were adjusted to be legal under both compilers, e.g. explicit casts where Janino erases generics, and `catch (Throwable)` for invoked methods declaring `throws Throwable`. These are no-ops under Janino. Two kinds of code can never compile with javac and are always routed to Janino, decided up front rather than as a fallback after a failed compile: codegen in REPL / Connect interactive sessions (javac cannot resolve REPL-generated classes), and generated code referencing a class nested in a Scala `package object` (`package` is a Java reserved word that cannot be spelled in Java source). A one-time INFO log records each routing so the choice is visible to operators. Note: the default is temporarily flipped to `jdk` in this PR so that CI runs the entire test suite against the new backend. It will be reverted to `janino` before merge (marked with a TEMP comment in `SQLConf`). ### Why are the changes needed? Whole-stage codegen depends entirely on Janino, which is unmaintained upstream (last release 3.1.12, Feb 2024). This gives Spark an alternative that is maintained on the JDK release cadence, without changing the default: Janino stays preferred since it compiles small generated units 30-300x faster, but if a Janino bug or incompatibility ever blocks Spark, users can switch with one config instead of waiting on an upstream that may never release again. ### Does this PR introduce _any_ user-facing change? No. The config is internal and the default behavior is unchanged (once the temporary CI-only default flip above is reverted). ### How was this patch tested? - New `CodeCompilerSuite` with 40 tests covering backend selection and routing, the source adaptations (import hoisting, bridge stripping, inner-class-name rewriting including string/comment safety), class resolution through enumerable and non-enumerable classloaders, compile-error type parity between the backends, per-backend separation in the compile cache, and interrupt isolation. - The full GHA matrix passes with the default flipped to `jdk`, i.e. the whole Spark test suite compiles and runs on javac-produced bytecode. - The Janino path is a code move; it was validated unchanged with `SQLQueryTestSuite` and the codegen suites under `janino` locally. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
