sezruby opened a new issue, #12225:
URL: https://github.com/apache/gluten/issues/12225
# `ArrowArrayStream.allocateNew(BufferAllocator)` in shaded bundle takes
wrong (relocated) parameter type, breaking interop with vanilla Apache Arrow
callers
## Summary
The gluten-velox bundle's
`org.apache.arrow.c.ArrowArrayStream.allocateNew(BufferAllocator)` method is
compiled to take a **gluten-internal shaded**
`org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator`, not the
public `org.apache.arrow.memory.BufferAllocator`. On clusters that load the
bundle into the JVM `AppClassLoader` (e.g. as a wildcard `extraClassPath`), it
shadows the user's vanilla Arrow `ArrowArrayStream` class. Any caller — Lance,
Iceberg, Snowflake JDBC, anyone using Arrow C-Data — then fails with
`NoSuchMethodError` because their public `BufferAllocator` doesn't match
gluten's shaded one.
This is a shading-config bug in `package/pom.xml`: `org.apache.arrow.c.*` is
excluded from relocation (correct, because Arrow's native C-Data JNI hardcodes
those class names), but `org.apache.arrow.memory.*` is relocated. Since
`org.apache.arrow.c.ArrowArrayStream` references
`org.apache.arrow.memory.BufferAllocator` in its public API signatures, the
resulting class is internally inconsistent — public class with private
parameter types.
## Repro
3-line standalone repro using only Apache Arrow Java (no other deps):
```scala
import org.apache.arrow.c.ArrowArrayStream
import org.apache.arrow.memory.RootAllocator
object GlutenArrowConflictRepro {
def main(args: Array[String]): Unit = {
val allocator = new RootAllocator(Long.MaxValue)
val stream = ArrowArrayStream.allocateNew(allocator) // <-- fails here
stream.close()
allocator.close()
}
}
```
Run as a Spark application on any cluster that has gluten's bundle on the
wildcard `extraClassPath` (e.g. IBM CP4D `spark175` engine ships gluten
1.7.0-WXD233RC1 at
`/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar`).
### Expected
```
=== Probe: where does ArrowArrayStream resolve from? ===
ArrowArrayStream class loaded from: file:/<path-to-vanilla-arrow-c-data>.jar
declared methods:
public static org.apache.arrow.c.ArrowArrayStream
org.apache.arrow.c.ArrowArrayStream.allocateNew(org.apache.arrow.memory.BufferAllocator)
=== Attempt ===
OK
DONE
```
### Actual
```
=== Probe: where does ArrowArrayStream resolve from? ===
ArrowArrayStream class loaded from:
file:/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar
declared methods:
public static org.apache.arrow.c.ArrowArrayStream
org.apache.arrow.c.ArrowArrayStream.allocateNew(
org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator) <--
shaded type!
=== Attempt ===
FAILED: java.lang.NoSuchMethodError:
org/apache/arrow/c/ArrowArrayStream.allocateNew(
Lorg/apache/arrow/memory/BufferAllocator;)Lorg/apache/arrow/c/ArrowArrayStream;
(loaded from
file:/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar
by jdk.internal.loader.ClassLoaders$AppClassLoader)
```
## Root cause
`package/pom.xml`, around line 121-130 (same on every release since v1.0.0):
```xml
<relocation>
<pattern>org.apache.arrow</pattern>
<shadedPattern>${gluten.shade.packageName}.org.apache.arrow</shadedPattern>
<!--arrow's C and dataset wrapper refers to the original class path,
so we should not relocate here-->
<excludes>
<exclude>org.apache.arrow.c.*</exclude>
<exclude>org.apache.arrow.c.jni.*</exclude>
<exclude>org.apache.arrow.dataset.**</exclude>
</excludes>
</relocation>
```
The intent is correct (don't relocate JNI-bound classes). But the public API
of `org.apache.arrow.c.*` returns and accepts `org.apache.arrow.memory.*`
types. When `ArrowArrayStream` is included in the bundle without relocation,
but `BufferAllocator` IS relocated, the bundled `ArrowArrayStream`'s method
signatures get re-bound to the shaded `BufferAllocator` at compile time. The
bundled class is then incompatible with anyone passing a vanilla
`BufferAllocator`.
The same applies to other public `org.apache.arrow.c.*` ↔
`org.apache.arrow.memory.*` boundary methods:
`Data.exportArrayStream(BufferAllocator, ...)`,
`ArrowSchema.allocateNew(BufferAllocator)`, etc.
## Why it's been latent
The bug has been present since v1.0.0 but only fires when:
1. Some other code on the same JVM calls
`ArrowArrayStream.allocateNew(BufferAllocator)` with a vanilla
`BufferAllocator`, AND
2. The gluten bundle's class wins resolution in the AppClassLoader (it ships
on `extraClassPath` wildcards)
Most pure-gluten workloads don't hit it because gluten's own internal
callers always use the shaded type. The bug becomes user-facing whenever a
Spark app pulls in another library that uses Arrow C-Data (Iceberg's Arrow
vector layer, Lance's Java writer, Snowflake JDBC's Arrow result decoder, etc.).
In our case (Lance Java + IBM CP4D Spark cluster), it surfaces because
Lance's `LanceDataWriter` calls
`ArrowArrayStream.allocateNew(LanceRuntime.allocator())` — which is the public
`BufferAllocator` — to hand off batches to the native Lance writer.
## Proposed fix
Add `org.apache.arrow.memory.**` (and possibly `org.apache.arrow.vector.**`
for symmetry — see Discussion below) to the relocation excludes, so the bundled
`ArrowArrayStream` references the public `BufferAllocator` and matches everyone
else's API:
```xml
<relocation>
<pattern>org.apache.arrow</pattern>
<shadedPattern>${gluten.shade.packageName}.org.apache.arrow</shadedPattern>
<!--arrow's C and dataset wrapper refers to the original class path,
so we should not relocate here. Their public API takes
org.apache.arrow.memory.* and returns org.apache.arrow.vector.*,
which therefore must also be left unshaded so the bundled C-Data
classes match the public Apache Arrow API.-->
<excludes>
<exclude>org.apache.arrow.c.*</exclude>
<exclude>org.apache.arrow.c.jni.*</exclude>
<exclude>org.apache.arrow.memory.**</exclude>
<exclude>org.apache.arrow.vector.**</exclude>
<exclude>org.apache.arrow.dataset.**</exclude>
</excludes>
</relocation>
```
The smaller fix (adding only `org.apache.arrow.memory.**`) addresses the
immediate `BufferAllocator` mismatch. Adding `vector.**` is necessary if any
C-Data method returns or accepts vector types — which
`Data.exportVectorSchemaRoot(...)` does.
## Discussion: why not just exclude `arrow.dataset` and the rest of arrow?
Gluten relocates Arrow precisely to avoid version conflicts with the user's
Arrow. The C-Data exclusion was a partial walk-back of that strategy because
the JNI native code can't be relocated. The fix here is just the consistent
extension: **anything reachable through the unshaded API surface must be
unshaded**.
This means gluten's internal users of `BufferAllocator`/`vector.*` will see
whatever Arrow version is on the user's classpath, not gluten's bundled one.
That's fine if gluten's compiled-against version is API-compatible with the
user's version — Arrow Java has been ABI/API stable from 7.x through 18.x for
the common types.
If gluten needs a *specific* `BufferAllocator` API that the user's Arrow
doesn't provide, that's a hard incompatibility and needs a separate fix (e.g.,
gluten provides its own non-conflicting class name).
## Affected files
- `package/pom.xml` (one block, ~3 lines added)
## Tests
Adding a tiny standalone Java test that asserts the public method signature
on the bundled `ArrowArrayStream`. Lives under `package/src/test/...` so it
runs as part of `package` module's tests.
## Severity / urgency
- Medium-high: blocks any Spark workload that combines gluten with another
library using Arrow C-Data
- Has been latent for ~2 years (since v1.0.0); affecting users now as more
libraries adopt Arrow C-Data for native interop
- Workaround possible per-app (re-shading at fat-jar level, classpath
ordering tricks) but fragile and doesn't help libraries that need to work
without app-level fat-jar control
## References
- Detailed analysis (with classloader debugging):
https://github.com/sezruby/lance-spark/blob/knn-external-index/docs/cpd-gluten-arrow-conflict.md
- Same bug visible since v1.0.0: confirmed by checking `package/pom.xml` on
tags v1.0.0, v1.1.0, v1.2.0, v1.3.0, v1.4.0, v1.5.0, v1.6.0, main
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]