Re: ArrowFileReader failing to read bytes written to Java output stream

Li Jin Tue, 26 Sep 2017 15:59:31 -0700

Andrew,

Seems like it fails to read the schema. It has reached the data part yet.
Can you share your reader/writer code?


On Tue, Sep 26, 2017 at 6:37 PM, Andrew Pham (BLOOMBERG/ 731 LEX) <
[email protected]> wrote:

> Hello there, I've written something that behaves similarly to:
>
> https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/arrow/
> ArrowConverters.scala#L73
>
> Except that for proof of concept purposes, it transforms Java objects with
> data into a byte[] payload.  The ArrowFileWriter log statements indicate
> that data is getting written to the output stream:
>
> 17:53:16.759 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 6
> 17:53:16.759 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 2
> 17:53:16.766 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.766 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 288
> 17:53:16.766 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.769 [main] DEBUG org.apache.arrow.vector.schema.ArrowRecordBatch
> - Buffer in RecordBatch at 0, length: 1
> 17:53:16.769 [main] DEBUG org.apache.arrow.vector.schema.ArrowRecordBatch
> - Buffer in RecordBatch at 8, length: 24
> 17:53:16.770 [main] DEBUG org.apache.arrow.vector.schema.ArrowRecordBatch
> - Buffer in RecordBatch at 32, length: 1
> 17:53:16.770 [main] DEBUG org.apache.arrow.vector.schema.ArrowRecordBatch
> - Buffer in RecordBatch at 40, length: 12
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 216
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 1
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 7
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 24
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 1
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 7
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 12
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.772 [main] DEBUG org.apache.arrow.vector.file.ArrowWriter -
> RecordBatch at 304, metadata: 224, body: 56
>
>
> However, when I wrap that payload into a ByteArrayReadableSeekableByteChannel
> and use ArrowFileReader (along with a BufferAllocator) to read it,
> ArrowFileReader is complaining that it's reading an invalid format, right
> at the point where I call reader.getVectorSchemaRoot():
>
> Exception in thread "main" 
> org.apache.arrow.vector.file.InvalidArrowFileException:
> missing Magic number [0, 0, 42, 0, 0, 0, 0, 0, 0, 0]
>      at org.apache.arrow.vector.file.ArrowFileReader.readSchema(
> ArrowFileReader.java:66)
>   at org.apache.arrow.vector.file.ArrowFileReader.readSchema(
> ArrowFileReader.java:37)
>   at org.apache.arrow.vector.file.ArrowReader.initialize(
> ArrowReader.java:162)
>  at org.apache.arrow.vector.file.ArrowReader.ensureInitialized(
> ArrowReader.java:153)
>   at org.apache.arrow.vector.file.ArrowReader.getVectorSchemaRoot(
> ArrowReader.java:67)
>  at com.bloomberg.andrew.sql.execution.arrow.ArrowConverters.
> byteArrayToBatch(ArrowConverters.java:89)
>         at com.bloomberg.andrew.sql.execution.arrow.ArrowPayload.
> loadBatch(ArrowPayload.java:18)
>      at com.bloomberg.andrew.test.arrow.ArrowPublisher.main(
> ArrowPublisher.java:28)
>
>
> I'm noticing that the number 42 is exactly the same as the value of the
> very last field/member in the very last object in our list (or equivalent,
> the very last column of the very last row of our table), and if I try out a
> bunch of different cases, this appears to be the case.  Clearly, I'm
> writing stuff to the output stream...but any ideas as to why ArrowReader is
> struggling?  There's some ideas regarding big endian/little endian stuff,
> but I'm not sure if that was addressed or not.  Thanks!

Re: ArrowFileReader failing to read bytes written to Java output stream

Reply via email to