sezruby opened a new pull request, #12245: URL: https://github.com/apache/gluten/pull/12245
**Draft.** Stacked on #12244. The diff below is what the unbundle looks like in pom-form; cross-distro testing (vanilla 3.5, DBR 16.4, Cloudera, 4.0/4.1) is still TODO and gates merge. ## What changes were proposed in this pull request? Stop bundling \`arrow-memory-*\` and \`arrow-vector\` in \`gluten-velox-bundle\`. Mark them as \`scope=provided\` in \`gluten-arrow/pom.xml\` and rely on Spark's own Arrow distribution at runtime (\`\$SPARK_HOME/jars/\` for Spark 3.x; declared in Spark 4.x's pom). \`arrow-c-data\` and \`arrow-dataset\` stay bundled — Spark does not ship those. ## Why Follow-up from #12226 discussion. The bundled-and-shaded-Arrow approach is the source of #12225 (and similar #7423): when gluten's bundle wins classloader resolution, its class signatures collide with the user's vanilla Arrow. #12226 fixed the immediate \`NoSuchMethodError\` by un-shading; but as @zhztheplayer noted, "Memory and vector APIs should be stable across minor versions" is a real risk worth eliminating: the cleanest fix is to not ship them at all. Effects: * gluten-velox-bundle no longer contains any \`org.apache.arrow.memory.*\` or \`org.apache.arrow.vector.*\` classes. Class-shadowing from #12225 disappears by construction. * The \`org.apache.arrow\` shade-relocation block in \`package/pom.xml\` is removed (nothing to relocate, since memory/vector aren't bundled and c-data/dataset were already excluded). * \`arrow-c-data\` / \`arrow-dataset\` remain bundled. With no relocation, their public API signatures bind to vanilla \`BufferAllocator\` / \`VectorSchemaRoot\` — exactly what every other Arrow C-Data caller on the classpath expects. * \`backends-velox/pom.xml\` re-declares \`arrow-memory-core\` and \`arrow-vector\` at \`provided\` scope so its compile classpath still resolves them after the gluten-arrow scope flip. \`gluten-ut/*\` and \`backends-clickhouse\` already declare them locally. ## Open questions / why this is a Draft 1. **\`arrow.version\` pin per Spark distro.** \`<arrow.version>15.0.0</arrow.version>\` matches vanilla Spark 3.5.x. DBR 16.4 ships Spark 3.5 with Arrow 12.0.1 — gluten compiled against 15 might \`NoSuchMethodError\` on DBR. Need to either (a) downgrade to LCD 12.0.1 for the Spark-3.5 profile, or (b) add a DBR-specific profile, or (c) declare gluten as DBR-incompatible. Cloudera flavors need similar verification. 2. **Cross-distro test matrix.** Want to actually run the gluten test suite against vanilla 3.5, DBR 16.4, Cloudera CDS, and 4.0/4.1 before merging. CI here only covers vanilla. 3. **Velox C++ side still uses bundled Arrow.** The cpp side links its own Arrow (the C++ patches in \`ep/build-velox/src/modify_arrow.patch\`); this PR only changes JVM-side bundling. The JVM ↔ C++ exchange happens via Arrow C-Data's stable ABI, so the JVM-side Arrow version doesn't need to match the C++-side one. Worth noting in case anyone assumes they should track. 4. \`dev/check-arrow-c-shading.sh\` from #12226 still passes — bundle still has \`org/apache/arrow/c/*\` and their signatures now reference unshaded \`memory.*\` / \`vector.*\` types (which resolve from Spark's bundled Arrow at runtime). ## How was this patch tested? Local build only. CI green needed before un-drafting. ## Closes / refs * Follow-up from #12226 discussion (zhztheplayer + FelixYBW asked for this direction). * Subsumes the immediate need for #12226's \`dev/check-arrow-c-shading.sh\` if the bundle fully stops containing memory/vector classes — but the script remains useful as a regression guard for the c-data classes still in the bundle. Stacked on: * #12244 — drop the \`15.0.0-gluten\` Arrow version rename -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
