sezruby opened a new pull request, #12245:
URL: https://github.com/apache/gluten/pull/12245

   **Draft.** Stacked on #12244. The diff below is what the unbundle looks like 
in pom-form; cross-distro testing (vanilla 3.5, DBR 16.4, Cloudera, 4.0/4.1) is 
still TODO and gates merge.
   
   ## What changes were proposed in this pull request?
   
   Stop bundling \`arrow-memory-*\` and \`arrow-vector\` in 
\`gluten-velox-bundle\`. Mark them as \`scope=provided\` in 
\`gluten-arrow/pom.xml\` and rely on Spark's own Arrow distribution at runtime 
(\`\$SPARK_HOME/jars/\` for Spark 3.x; declared in Spark 4.x's pom).
   
   \`arrow-c-data\` and \`arrow-dataset\` stay bundled — Spark does not ship 
those.
   
   ## Why
   
   Follow-up from #12226 discussion. The bundled-and-shaded-Arrow approach is 
the source of #12225 (and similar #7423): when gluten's bundle wins classloader 
resolution, its class signatures collide with the user's vanilla Arrow. #12226 
fixed the immediate \`NoSuchMethodError\` by un-shading; but as @zhztheplayer 
noted, "Memory and vector APIs should be stable across minor versions" is a 
real risk worth eliminating: the cleanest fix is to not ship them at all.
   
   Effects:
   * gluten-velox-bundle no longer contains any \`org.apache.arrow.memory.*\` 
or \`org.apache.arrow.vector.*\` classes. Class-shadowing from #12225 
disappears by construction.
   * The \`org.apache.arrow\` shade-relocation block in \`package/pom.xml\` is 
removed (nothing to relocate, since memory/vector aren't bundled and 
c-data/dataset were already excluded).
   * \`arrow-c-data\` / \`arrow-dataset\` remain bundled. With no relocation, 
their public API signatures bind to vanilla \`BufferAllocator\` / 
\`VectorSchemaRoot\` — exactly what every other Arrow C-Data caller on the 
classpath expects.
   * \`backends-velox/pom.xml\` re-declares \`arrow-memory-core\` and 
\`arrow-vector\` at \`provided\` scope so its compile classpath still resolves 
them after the gluten-arrow scope flip. \`gluten-ut/*\` and 
\`backends-clickhouse\` already declare them locally.
   
   ## Open questions / why this is a Draft
   
   1. **\`arrow.version\` pin per Spark distro.** 
\`<arrow.version>15.0.0</arrow.version>\` matches vanilla Spark 3.5.x. DBR 16.4 
ships Spark 3.5 with Arrow 12.0.1 — gluten compiled against 15 might 
\`NoSuchMethodError\` on DBR. Need to either (a) downgrade to LCD 12.0.1 for 
the Spark-3.5 profile, or (b) add a DBR-specific profile, or (c) declare gluten 
as DBR-incompatible. Cloudera flavors need similar verification.
   2. **Cross-distro test matrix.** Want to actually run the gluten test suite 
against vanilla 3.5, DBR 16.4, Cloudera CDS, and 4.0/4.1 before merging. CI 
here only covers vanilla.
   3. **Velox C++ side still uses bundled Arrow.** The cpp side links its own 
Arrow (the C++ patches in \`ep/build-velox/src/modify_arrow.patch\`); this PR 
only changes JVM-side bundling. The JVM ↔ C++ exchange happens via Arrow 
C-Data's stable ABI, so the JVM-side Arrow version doesn't need to match the 
C++-side one. Worth noting in case anyone assumes they should track.
   4. \`dev/check-arrow-c-shading.sh\` from #12226 still passes — bundle still 
has \`org/apache/arrow/c/*\` and their signatures now reference unshaded 
\`memory.*\` / \`vector.*\` types (which resolve from Spark's bundled Arrow at 
runtime).
   
   ## How was this patch tested?
   
   Local build only. CI green needed before un-drafting.
   
   ## Closes / refs
   
   * Follow-up from #12226 discussion (zhztheplayer + FelixYBW asked for this 
direction).
   * Subsumes the immediate need for #12226's \`dev/check-arrow-c-shading.sh\` 
if the bundle fully stops containing memory/vector classes — but the script 
remains useful as a regression guard for the c-data classes still in the bundle.
   
   Stacked on:
   * #12244 — drop the \`15.0.0-gluten\` Arrow version rename


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to