Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/15821
## Dependency Info
This change does add Apache Arrow as a dependency, specifically the Java
arrow-vector artifact. For Python, usage is optional and test are conditional
on ability to import pyarrow. The Java Arrow dependency tree is minimal and
can be found
[here](https://github.com/apache/spark/files/798133/arrow-vector_deptree.txt).
It is relatively small, but does include Netty 4.0.41 (Spark currently uses
netty-all 4.0.42 and doesn't conflict).
Changes to Spark APIs have been kept at a minimal, and all Arrow classes
have been encapsulated within `o.a.s.sql.ArrowConverters`. On the Scala side,
a package private method `toArrowPayloadBytes` has been added to perform the
conversion to an Arrow 'payload' on the executor JVM. This would also allow
uses for the conversion, like with R for instance. On the Python side,
additions are a method `collectAsArrow` to collect and serve the Arrow payload
to Python and a flag on `toPandas` that when enabled, will make use of
`collectToArrow`.
I know Spark has been burned on other dependencies before, like file
formats, so I'll just point out how this is different. Unlike a file on disk,
Arrow is an in-memory format and is not meant to persist on disk. So many
issues that might arise when choosing a file format are not applicable here. A
great deal of care has gone in upfront to define the Arrow
[format](https://github.com/apache/arrow/tree/master/format) so that it can
remain as stable as possible. I have also heard from the Arrow community that
they are fully committed to ensure success in projects like Spark, and meet
compatibility needs. I can also attest from first-hand experience that they
have been incredibly responsive to issues related to the this PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]