Wes McKinney commented on ARROW-288:

hi [~freiss] and [~jlaskowski] -- we made pretty big progress on the C++ side 
to be able to be closer to full interoperability with the Arrow Java library. 
We still need to do some integration testing, but it would be great to start 
exploring the technical plan for making this happen. I was just talking with 
[~rxin] about this the other day, so there may be someone on the Spark side who 
could help with this effort, too. 

The first step is to convert a Spark Dataset into 1 or more Arrow record 
batches, including metadata conversion, and then converting back. The Java <-> 
C++ data movement itself is a comparatively minor task because that is just 
sending a serialized byte buffer through the existing protocol. We can test 
this out in Python using the Arrow <-> pandas bridge which has already been 

Let me know if anyone will have the bandwidth to work on this and we can 
coordinate. thanks!

> Implement Arrow adapter for Spark Datasets
> ------------------------------------------
>                 Key: ARROW-288
>                 URL: https://issues.apache.org/jira/browse/ARROW-288
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Java - Vectors
>            Reporter: Wes McKinney
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation

This message was sent by Atlassian JIRA

Reply via email to