[I] Shaded ArrowArrayStream.allocateNew signature points at gluten-shaded BufferAllocator, breaking Arrow C-Data interop [gluten]

via GitHub Tue, 02 Jun 2026 10:01:12 -0700


sezruby opened a new issue, #12225:
URL: https://github.com/apache/gluten/issues/12225


   # `ArrowArrayStream.allocateNew(BufferAllocator)` in shaded bundle takes 
wrong (relocated) parameter type, breaking interop with vanilla Apache Arrow 
callers
   
   ## Summary
   
   The gluten-velox bundle's 
`org.apache.arrow.c.ArrowArrayStream.allocateNew(BufferAllocator)` method is 
compiled to take a **gluten-internal shaded** 
`org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator`, not the 
public `org.apache.arrow.memory.BufferAllocator`. On clusters that load the 
bundle into the JVM `AppClassLoader` (e.g. as a wildcard `extraClassPath`), it 
shadows the user's vanilla Arrow `ArrowArrayStream` class. Any caller — Lance, 
Iceberg, Snowflake JDBC, anyone using Arrow C-Data — then fails with 
`NoSuchMethodError` because their public `BufferAllocator` doesn't match 
gluten's shaded one.
   
   This is a shading-config bug in `package/pom.xml`: `org.apache.arrow.c.*` is 
excluded from relocation (correct, because Arrow's native C-Data JNI hardcodes 
those class names), but `org.apache.arrow.memory.*` is relocated. Since 
`org.apache.arrow.c.ArrowArrayStream` references 
`org.apache.arrow.memory.BufferAllocator` in its public API signatures, the 
resulting class is internally inconsistent — public class with private 
parameter types.
   
   ## Repro
   
   3-line standalone repro using only Apache Arrow Java (no other deps):
   
   ```scala
   import org.apache.arrow.c.ArrowArrayStream
   import org.apache.arrow.memory.RootAllocator
   
   object GlutenArrowConflictRepro {
     def main(args: Array[String]): Unit = {
       val allocator = new RootAllocator(Long.MaxValue)
       val stream = ArrowArrayStream.allocateNew(allocator)  // <-- fails here
       stream.close()
       allocator.close()
     }
   }
   ```
   
   Run as a Spark application on any cluster that has gluten's bundle on the 
wildcard `extraClassPath` (e.g. IBM CP4D `spark175` engine ships gluten 
1.7.0-WXD233RC1 at 
`/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar`).
   
   ### Expected
   
   ```
   === Probe: where does ArrowArrayStream resolve from? ===
   ArrowArrayStream class loaded from: file:/<path-to-vanilla-arrow-c-data>.jar
   declared methods:
     public static org.apache.arrow.c.ArrowArrayStream
     
org.apache.arrow.c.ArrowArrayStream.allocateNew(org.apache.arrow.memory.BufferAllocator)
   === Attempt ===
   OK
   DONE
   ```
   
   ### Actual
   
   ```
   === Probe: where does ArrowArrayStream resolve from? ===
   ArrowArrayStream class loaded from: 
file:/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar
   declared methods:
     public static org.apache.arrow.c.ArrowArrayStream
     org.apache.arrow.c.ArrowArrayStream.allocateNew(
       org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator)   <-- 
shaded type!
   === Attempt ===
   FAILED: java.lang.NoSuchMethodError: 
org/apache/arrow/c/ArrowArrayStream.allocateNew(
     
Lorg/apache/arrow/memory/BufferAllocator;)Lorg/apache/arrow/c/ArrowArrayStream;
     (loaded from 
file:/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar
      by jdk.internal.loader.ClassLoaders$AppClassLoader)
   ```
   
   ## Root cause
   
   `package/pom.xml`, around line 121-130 (same on every release since v1.0.0):
   
   ```xml
   <relocation>
     <pattern>org.apache.arrow</pattern>
     <shadedPattern>${gluten.shade.packageName}.org.apache.arrow</shadedPattern>
     <!--arrow's C and dataset wrapper refers to the original class path,
         so we should not relocate here-->
     <excludes>
       <exclude>org.apache.arrow.c.*</exclude>
       <exclude>org.apache.arrow.c.jni.*</exclude>
       <exclude>org.apache.arrow.dataset.**</exclude>
     </excludes>
   </relocation>
   ```
   
   The intent is correct (don't relocate JNI-bound classes). But the public API 
of `org.apache.arrow.c.*` returns and accepts `org.apache.arrow.memory.*` 
types. When `ArrowArrayStream` is included in the bundle without relocation, 
but `BufferAllocator` IS relocated, the bundled `ArrowArrayStream`'s method 
signatures get re-bound to the shaded `BufferAllocator` at compile time. The 
bundled class is then incompatible with anyone passing a vanilla 
`BufferAllocator`.
   
   The same applies to other public `org.apache.arrow.c.*` ↔ 
`org.apache.arrow.memory.*` boundary methods: 
`Data.exportArrayStream(BufferAllocator, ...)`, 
`ArrowSchema.allocateNew(BufferAllocator)`, etc.
   
   ## Why it's been latent
   
   The bug has been present since v1.0.0 but only fires when:
   1. Some other code on the same JVM calls 
`ArrowArrayStream.allocateNew(BufferAllocator)` with a vanilla 
`BufferAllocator`, AND
   2. The gluten bundle's class wins resolution in the AppClassLoader (it ships 
on `extraClassPath` wildcards)
   
   Most pure-gluten workloads don't hit it because gluten's own internal 
callers always use the shaded type. The bug becomes user-facing whenever a 
Spark app pulls in another library that uses Arrow C-Data (Iceberg's Arrow 
vector layer, Lance's Java writer, Snowflake JDBC's Arrow result decoder, etc.).
   
   In our case (Lance Java + IBM CP4D Spark cluster), it surfaces because 
Lance's `LanceDataWriter` calls 
`ArrowArrayStream.allocateNew(LanceRuntime.allocator())` — which is the public 
`BufferAllocator` — to hand off batches to the native Lance writer.
   
   ## Proposed fix
   
   Add `org.apache.arrow.memory.**` (and possibly `org.apache.arrow.vector.**` 
for symmetry — see Discussion below) to the relocation excludes, so the bundled 
`ArrowArrayStream` references the public `BufferAllocator` and matches everyone 
else's API:
   
   ```xml
   <relocation>
     <pattern>org.apache.arrow</pattern>
     <shadedPattern>${gluten.shade.packageName}.org.apache.arrow</shadedPattern>
     <!--arrow's C and dataset wrapper refers to the original class path,
         so we should not relocate here. Their public API takes
         org.apache.arrow.memory.* and returns org.apache.arrow.vector.*,
         which therefore must also be left unshaded so the bundled C-Data
         classes match the public Apache Arrow API.-->
     <excludes>
       <exclude>org.apache.arrow.c.*</exclude>
       <exclude>org.apache.arrow.c.jni.*</exclude>
       <exclude>org.apache.arrow.memory.**</exclude>
       <exclude>org.apache.arrow.vector.**</exclude>
       <exclude>org.apache.arrow.dataset.**</exclude>
     </excludes>
   </relocation>
   ```
   
   The smaller fix (adding only `org.apache.arrow.memory.**`) addresses the 
immediate `BufferAllocator` mismatch. Adding `vector.**` is necessary if any 
C-Data method returns or accepts vector types — which 
`Data.exportVectorSchemaRoot(...)` does.
   
   ## Discussion: why not just exclude `arrow.dataset` and the rest of arrow?
   
   Gluten relocates Arrow precisely to avoid version conflicts with the user's 
Arrow. The C-Data exclusion was a partial walk-back of that strategy because 
the JNI native code can't be relocated. The fix here is just the consistent 
extension: **anything reachable through the unshaded API surface must be 
unshaded**.
   
   This means gluten's internal users of `BufferAllocator`/`vector.*` will see 
whatever Arrow version is on the user's classpath, not gluten's bundled one. 
That's fine if gluten's compiled-against version is API-compatible with the 
user's version — Arrow Java has been ABI/API stable from 7.x through 18.x for 
the common types.
   
   If gluten needs a *specific* `BufferAllocator` API that the user's Arrow 
doesn't provide, that's a hard incompatibility and needs a separate fix (e.g., 
gluten provides its own non-conflicting class name).
   
   ## Affected files
   
   - `package/pom.xml` (one block, ~3 lines added)
   
   ## Tests
   
   Adding a tiny standalone Java test that asserts the public method signature 
on the bundled `ArrowArrayStream`. Lives under `package/src/test/...` so it 
runs as part of `package` module's tests.
   
   ## Severity / urgency
   
   - Medium-high: blocks any Spark workload that combines gluten with another 
library using Arrow C-Data
   - Has been latent for ~2 years (since v1.0.0); affecting users now as more 
libraries adopt Arrow C-Data for native interop
   - Workaround possible per-app (re-shading at fat-jar level, classpath 
ordering tricks) but fragile and doesn't help libraries that need to work 
without app-level fat-jar control
   
   ## References
   
   - Detailed analysis (with classloader debugging): 
https://github.com/sezruby/lance-spark/blob/knn-external-index/docs/cpd-gluten-arrow-conflict.md
   - Same bug visible since v1.0.0: confirmed by checking `package/pom.xml` on 
tags v1.0.0, v1.1.0, v1.2.0, v1.3.0, v1.4.0, v1.5.0, v1.6.0, main
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Shaded ArrowArrayStream.allocateNew signature points at gluten-shaded BufferAllocator, breaking Arrow C-Data interop [gluten]

Reply via email to