Also for reference, this is apparently Arrow Schema used by the ArrowFileWriter
to write to the output stream (given by root.getSchema().toString() and
root.getSchema().toJson()):
Schema<price: FloatingPoint(DOUBLE), numShares: Int(32, true)>
{
"fields" : [ {
"name" : "price",
"nullable" : true,
"type" : {
"name" : "floatingpoint",
"precision" : "DOUBLE"
},
"children" : [ ],
"typeLayout" : {
"vectors" : [ {
"type" : "VALIDITY",
"typeBitWidth" : 1
}, {
"type" : "DATA",
"typeBitWidth" : 64
} ]
}
}, {
"name" : "numShares",
"nullable" : true,
"type" : {
"name" : "int",
"bitWidth" : 32,
"isSigned" : true
},
"children" : [ ],
"typeLayout" : {
"vectors" : [ {
"type" : "VALIDITY",
"typeBitWidth" : 1
}, {
"type" : "DATA",
"typeBitWidth" : 32
} ]
}
} ]
}
Given our bytes (wrapped by a SeekableByteChannel), the reader is unable to
obtain the schema from this. Any ideas as to what could be happening? Cheers!
From: [email protected] At: 09/26/17 18:59:18To: Andrew Pham (BLOOMBERG/
731 LEX ) , [email protected]
Subject: Re: ArrowFileReader failing to read bytes written to Java output stream
Andrew,
Seems like it fails to read the schema. It has reached the data part yet.
Can you share your reader/writer code?
On Tue, Sep 26, 2017 at 6:37 PM, Andrew Pham (BLOOMBERG/ 731 LEX) <
[email protected]> wrote:
> Hello there, I've written something that behaves similarly to:
>
> https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/arrow/
> ArrowConverters.scala#L73
>
> Except that for proof of concept purposes, it transforms Java objects with
> data into a byte[] payload. The ArrowFileWriter log statements indicate
> that data is getting written to the output stream:
>
> 17:53:16.759 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 6
> 17:53:16.759 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 2
> 17:53:16.766 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.766 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 288
> 17:53:16.766 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.769 [main] DEBUG org.apache.arrow.vector.schema.ArrowRecordBatch
> - Buffer in RecordBatch at 0, length: 1
> 17:53:16.769 [main] DEBUG org.apache.arrow.vector.schema.ArrowRecordBatch
> - Buffer in RecordBatch at 8, length: 24
> 17:53:16.770 [main] DEBUG org.apache.arrow.vector.schema.ArrowRecordBatch
> - Buffer in RecordBatch at 32, length: 1
> 17:53:16.770 [main] DEBUG org.apache.arrow.vector.schema.ArrowRecordBatch
> - Buffer in RecordBatch at 40, length: 12
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 216
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 1
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 7
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 24
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 1
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 7
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 12
> 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> Writing buffer with size: 4
> 17:53:16.772 [main] DEBUG org.apache.arrow.vector.file.ArrowWriter -
> RecordBatch at 304, metadata: 224, body: 56
>
>
> However, when I wrap that payload into a ByteArrayReadableSeekableByteChannel
> and use ArrowFileReader (along with a BufferAllocator) to read it,
> ArrowFileReader is complaining that it's reading an invalid format, right
> at the point where I call reader.getVectorSchemaRoot():
>
> Exception in thread "main"
> org.apache.arrow.vector.file.InvalidArrowFileException:
> missing Magic number [0, 0, 42, 0, 0, 0, 0, 0, 0, 0]
> at org.apache.arrow.vector.file.ArrowFileReader.readSchema(
> ArrowFileReader.java:66)
> at org.apache.arrow.vector.file.ArrowFileReader.readSchema(
> ArrowFileReader.java:37)
> at org.apache.arrow.vector.file.ArrowReader.initialize(
> ArrowReader.java:162)
> at org.apache.arrow.vector.file.ArrowReader.ensureInitialized(
> ArrowReader.java:153)
> at org.apache.arrow.vector.file.ArrowReader.getVectorSchemaRoot(
> ArrowReader.java:67)
> at com.bloomberg.andrew.sql.execution.arrow.ArrowConverters.
> byteArrayToBatch(ArrowConverters.java:89)
> at com.bloomberg.andrew.sql.execution.arrow.ArrowPayload.
> loadBatch(ArrowPayload.java:18)
> at com.bloomberg.andrew.test.arrow.ArrowPublisher.main(
> ArrowPublisher.java:28)
>
>
> I'm noticing that the number 42 is exactly the same as the value of the
> very last field/member in the very last object in our list (or equivalent,
> the very last column of the very last row of our table), and if I try out a
> bunch of different cases, this appears to be the case. Clearly, I'm
> writing stuff to the output stream...but any ideas as to why ArrowReader is
> struggling? There's some ideas regarding big endian/little endian stuff,
> but I'm not sure if that was addressed or not. Thanks!