Until we have integration tests proving otherwise, you can assume that
the IPC wire/file formats are currently incompatible (not on purpose).
Both Julien and I are working on corresponding efforts in Java and C++
this week (e.g. see https://github.com/apache/arrow/pull/201 and
ARROW-373), but it's going to be a little while until completed. We
would certainly welcome help (including code review). This is a
necessary step to moving forward with any hybrid Java/C++ Arrow
applications.

Also, can you please update SPARK-13534 when you have a WIP branch or
to otherwise say that you've started work on the project (the issue
isn't assigned to anyone)? I also have been looking into this with my
colleagues at Two Sigma and I want to make sure we don't duplicate
efforts. The general approach you've described makes sense to me.

Thanks
Wes

On Tue, Nov 8, 2016 at 6:58 PM, Bryan Cutler <[email protected]> wrote:
> Hi Devs,
>
> I'm currently working on SPARK-13534 to use Arrow in Spark DataFrame
> toPandas conversion and getting stuck with an invalid metadata size error
> trying to send a simple ArrowRecordBatch created in Java over a socket to
> Python.  The strategy so far is like this:
>
> Java side:
> - make a simple ArrowRecordBatch (1 field of Ints)
> - create an ArrowWriter using a ByteArrayOutputStream
> - call writer.writerRecordBatch() and writer.close() to write to a ByteArray
> - send the ByteArray (framed with size) over a socket
>
> Python side:
> - read the ByteArray over the socket
> - create an ArrowFileReader with the read bytes
> - call reader.get_record_batch(0) to convert the bytes to a RecordBattch
>
> This results in "pyarrow.error.ArrowException: Invalid: metadata size
> invalid" and debugging shows the ArrowFileReader getting a metadata size of
> 269422093.  This is obviously way off, but it does seem to read some things
> correctly like number of batches and offset.  Here is some debug output
>
> ArrowWriter.java log output
> 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 6
> 16/11/07 12:04:18 DEBUG ArrowWriter: RecordBatch at 8, metadata: 104, body:
> 24
> 16/11/07 12:04:18 DEBUG ArrowWriter: Footer starts at 136, length: 224
> 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 370
>
> Arrow-cpp printouts
> read length 370
> num batches 1
> metadata size 269422093
> offset 136
>
> From what I can tell by looking through the code, it seems like Java uses
> Flatbuffers to write the metadata, but I don't see the Cython side using it
> to read back the metadata.
>
> Should this work with the classes I'm uses on both sides? or am I way off
> with the above strategy?  I made a simplified example that mimics the Spark
> integration and will reproduce the error, here is the gist
> https://gist.github.com/BryanCutler/930ecd1de1c6d9484931505dcb6bb321
>
> Sorry for the barrage of info, but any help would be much appreciated!
>
> Thanks,
> Bryan

Reply via email to