Re: ArrowFileReader failing to read bytes written to Java output stream

Bryan Cutler Thu, 28 Sep 2017 11:18:03 -0700

Oh, I think I see the problem.  You need to put "writer.close();"  in a
finally block instead of a catch block in ArrowPayloadIterator.  That is
when the schema actually gets written to the stream.


On Thu, Sep 28, 2017 at 9:27 AM, Andrew Pham (BLOOMBERG/ 731 LEX) <
[email protected]> wrote:

> Ah, looks like it was stripped for some reason.  Check out:
>
> https://pastebin.com/4abb2txs
> https://pastebin.com/53UnimQ6
> https://pastebin.com/KwmP7Ens
>
> My hunch is that I'm writing to the output stream with an incorrectly
> determined Arrow Schema, and so the reader can't pick it up.  If that's the
> case, can anyone verify/let me know what to do for a fix?  Currently, I'm
> just serializing a list of objects (each containing an Integer and Double)
> and writing that onto the output stream...that seems to be taken care of.
> Cheers
>
> From: [email protected] At: 09/27/17 16:56:41To:  Andrew Pham
> (BLOOMBERG/ 731 LEX ) ,  [email protected]
> Subject: Re: ArrowFileReader failing to read bytes written to Java output
> stream
>
> Hi Andrew,
>
> I do not see the attached code, maybe the attachments got stripped?  Is it
> small enough to just inline in the message?
>
> Bryan
>
> On Wed, Sep 27, 2017 at 12:26 PM, Andrew Pham (BLOOMBERG/ 731 LEX) <
> [email protected]> wrote:
>
> > Also for reference, this is apparently Arrow Schema used by the
> > ArrowFileWriter to write to the output stream (given by
> > root.getSchema().toString() and root.getSchema().toJson()):
> >
> > Schema<price: FloatingPoint(DOUBLE), numShares: Int(32, true)>
> > {
> >   "fields" : [ {
> >     "name" : "price",
> >     "nullable" : true,
> >     "type" : {
> >       "name" : "floatingpoint",
> >       "precision" : "DOUBLE"
> >     },
> >     "children" : [ ],
> >     "typeLayout" : {
> >       "vectors" : [ {
> >         "type" : "VALIDITY",
> >         "typeBitWidth" : 1
> >       }, {
> >         "type" : "DATA",
> >         "typeBitWidth" : 64
> >       } ]
> >     }
> >   }, {
> >     "name" : "numShares",
> >     "nullable" : true,
> >     "type" : {
> >       "name" : "int",
> >       "bitWidth" : 32,
> >       "isSigned" : true
> >     },
> >     "children" : [ ],
> >     "typeLayout" : {
> >       "vectors" : [ {
> >         "type" : "VALIDITY",
> >         "typeBitWidth" : 1
> >       }, {
> >         "type" : "DATA",
> >         "typeBitWidth" : 32
> >       } ]
> >     }
> >   } ]
> > }
> >
> >
> > Given our bytes (wrapped by a SeekableByteChannel), the reader is unable
> > to obtain the schema from this.  Any ideas as to what could be happening?
> > Cheers!
> >
> > From: [email protected] At: 09/26/17 18:59:18To:  Andrew Pham
> > (BLOOMBERG/ 731 LEX ) ,  [email protected]
> > Subject: Re: ArrowFileReader failing to read bytes written to Java output
> > stream
> >
> > Andrew,
> >
> > Seems like it fails to read the schema. It has reached the data part yet.
> > Can you share your reader/writer code?
> >
> > On Tue, Sep 26, 2017 at 6:37 PM, Andrew Pham (BLOOMBERG/ 731 LEX) <
> > [email protected]> wrote:
> >
> > > Hello there, I've written something that behaves similarly to:
> > >
> > > https://github.com/apache/spark/blob/master/sql/core/
> > > src/main/scala/org/apache/spark/sql/execution/arrow/
> > > ArrowConverters.scala#L73
> > >
> > > Except that for proof of concept purposes, it transforms Java objects
> > with
> > > data into a byte[] payload.  The ArrowFileWriter log statements
> indicate
> > > that data is getting written to the output stream:
> > >
> > > 17:53:16.759 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 6
> > > 17:53:16.759 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 2
> > > 17:53:16.766 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 4
> > > 17:53:16.766 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 288
> > > 17:53:16.766 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 4
> > > 17:53:16.769 [main] DEBUG org.apache.arrow.vector.
> > schema.ArrowRecordBatch
> > > - Buffer in RecordBatch at 0, length: 1
> > > 17:53:16.769 [main] DEBUG org.apache.arrow.vector.
> > schema.ArrowRecordBatch
> > > - Buffer in RecordBatch at 8, length: 24
> > > 17:53:16.770 [main] DEBUG org.apache.arrow.vector.
> > schema.ArrowRecordBatch
> > > - Buffer in RecordBatch at 32, length: 1
> > > 17:53:16.770 [main] DEBUG org.apache.arrow.vector.
> > schema.ArrowRecordBatch
> > > - Buffer in RecordBatch at 40, length: 12
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 4
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 216
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 4
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 1
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 7
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 24
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 1
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 7
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 12
> > > 17:53:16.771 [main] DEBUG org.apache.arrow.vector.file.WriteChannel -
> > > Writing buffer with size: 4
> > > 17:53:16.772 [main] DEBUG org.apache.arrow.vector.file.ArrowWriter -
> > > RecordBatch at 304, metadata: 224, body: 56
> > >
> > >
> > > However, when I wrap that payload into a ByteArrayReadableSeekableByteC
> > hannel
> > > and use ArrowFileReader (along with a BufferAllocator) to read it,
> > > ArrowFileReader is complaining that it's reading an invalid format,
> right
> > > at the point where I call reader.getVectorSchemaRoot():
> > >
> > > Exception in thread "main" org.apache.arrow.vector.file.
> > InvalidArrowFileException:
> > > missing Magic number [0, 0, 42, 0, 0, 0, 0, 0, 0, 0]
> > >      at org.apache.arrow.vector.file.ArrowFileReader.readSchema(
> > > ArrowFileReader.java:66)
> > >   at org.apache.arrow.vector.file.ArrowFileReader.readSchema(
> > > ArrowFileReader.java:37)
> > >   at org.apache.arrow.vector.file.ArrowReader.initialize(
> > > ArrowReader.java:162)
> > >  at org.apache.arrow.vector.file.ArrowReader.ensureInitialized(
> > > ArrowReader.java:153)
> > >   at org.apache.arrow.vector.file.ArrowReader.getVectorSchemaRoot(
> > > ArrowReader.java:67)
> > >  at com.bloomberg.andrew.sql.execution.arrow.ArrowConverters.
> > > byteArrayToBatch(ArrowConverters.java:89)
> > >         at com.bloomberg.andrew.sql.execution.arrow.ArrowPayload.
> > > loadBatch(ArrowPayload.java:18)
> > >      at com.bloomberg.andrew.test.arrow.ArrowPublisher.main(
> > > ArrowPublisher.java:28)
> > >
> > >
> > > I'm noticing that the number 42 is exactly the same as the value of the
> > > very last field/member in the very last object in our list (or
> > equivalent,
> > > the very last column of the very last row of our table), and if I try
> > out a
> > > bunch of different cases, this appears to be the case.  Clearly, I'm
> > > writing stuff to the output stream...but any ideas as to why
> ArrowReader
> > is
> > > struggling?  There's some ideas regarding big endian/little endian
> stuff,
> > > but I'm not sure if that was addressed or not.  Thanks!
> >
> >
> >
>
>
>

Re: ArrowFileReader failing to read bytes written to Java output stream

Reply via email to