sezruby commented on PR #12226:
URL: https://github.com/apache/gluten/pull/12226#issuecomment-4615112853

   > can we completely remove Arrow from the bundled Gluten Jar, and let user 
rely on Spark's bundled Arrow instead?  I assume we don't have any customized 
Arrow code now: #12130
   
   Worth doing, but a couple of things worth confirming first:
   
   #12130 removed the dead Arrow-CSV / Arrow-Dataset JVM paths. That's 
progress, but there are still ~34 files under `gluten-arrow/` and elsewhere 
that call `org.apache.arrow.*` (`ColumnarBatches`, `ArrowColumnVector`, 
`ArrowWritableColumnVector`, `ColumnarBatchSerializer`, JNI columnar batch 
bridges, `SparkArrowUtil`, etc.). Those code paths are alive on the Velox hot 
path and the bundled Arrow is what they currently resolve against.
   
   If we drop the bundled Arrow:
   
   1. **`arrow.version` in gluten's poms** would need to align with whatever 
the lowest-common-denominator target Spark ships — Arrow 12 for Spark 3.5 (DBR 
16.4 ships 12.0.1, vanilla Spark 3.5.x ships 15.0.0), Arrow 18 for Spark 4.x. 
Compile profile per Spark version, similar to how `<spark.version>` is already 
keyed.
   2. **No `15.0.0-gluten` source patches relied on** — needs confirmation. The 
`arrow-gluten.version = 15.0.0-gluten` in `pom.xml` suggests there's at least 
*some* customization, even if the dead code in #12130 was the main consumer. 
Worth a sweep for any remaining call site that depends on a non-public Arrow 
API.
   3. **Testing matrix** — minor: confirm no `NoSuchMethodError` on the Spark 
distros gluten claims support for, especially DBR / Cloudera flavors that ship 
older Arrow.
   
   So I think it's the right long-term direction. As an immediate fix this PR 
makes the current shading approach internally consistent (which is 
independently a valid bug fix, since the partial-shading is a latent bug 
regardless of whether Arrow gets unbundled later). Happy to take either path — 
let me know if you'd prefer I close this and pursue the Arrow-unbundling work 
instead, or merge this as the short-term fix and treat unbundling as a 
follow-up.
   
   > a Scala type mismatch caused by the Maven Shade Plugin not rewriting 
ScalaSignature annotations
   
   @philo-he good link — yes, the partial-shading also breaks downstream Scala 
consumers that pull the gluten jar as a dep, since their compile-time Arrow 
type doesn't match what gluten expects across the API boundary. This PR fixes 
that as a side effect by keeping the boundary types unshaded so they match the 
public Apache Arrow types every other Scala consumer compiles against.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to