Wes,

Check out reader.cpp.  It seg faults when it gets to the next
message-that-is-not-a-message... it is a footer.  But I have no way to know
this in reader.cpp because I'm piping the File in via stdin.

In seeker.cpp I seek to the end and figure out where the footer is (this is
a py-arrow-written file) and indeed it is at the offset where my "streamed
File" reader bombed out.  If EOS were mandatory at this location it would
have been fine... I would have said "oh, time for the footer!"

Basically what I'm saying is that we can't assume that File won't be
processed as a stream.  In an actual non-file stream it is either EOS or
end-of-stream.  But with a file-as-stream there is more data and we have to
know it isn't the stream anymore.

Otherwise we've locked the File use-cases into those where the File isn't
streamed -- i.e. is seekable.  See what I'm saying?  For reader.cpp to have
been functional it would have had to read the entire File into a buffer
before parsing, since it could not seek().  This could be easily avoided
with a mandatory EOS in the File format.  Basically:

<magic number "ARROW1">
<empty padding bytes [to 8 byte boundary]>
<STREAMING FORMAT>
*<EOS if not in stream>*
<FOOTER>
<FOOTER SIZE: int32>
<magic number "ARROW1">

-John

On Tue, May 21, 2019 at 4:44 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi John,
>
> I'm not sure I follow. The EOS you're referring to is part of the
> streaming format. It's designed to be readable using an InputStream
> interface that does not support seeking at all. You can see the core
> logic where messages are popped off the InputStream here
>
>
> https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f511962c8b012/cpp/src/arrow/ipc/message.cc#L281
>
> If the end of the byte stream is reached, or EOS (0) is encountered,
> then the stream reader stops iteration.
>
> - Wes
>
> On Tue, May 21, 2019 at 4:34 PM John Muehlhausen <j...@jgm.org> wrote:
> >
> > https://arrow.apache.org/docs/format/IPC.html#file-format
> >
> > <EOS [optional]: int32>
> >
> > If this stream marker is optional in the file format, doesn't this
> prevent
> > someone from reading the file without being able to seek() it, e.g. if it
> > is "piped in" to a program?  Or otherwise they'll have to stream in the
> > entire thing before they can start parsing?
> >
> > Any reason it can't be mandatory for a File?
> >
> > -John
>
import pyarrow as pa

batch=pa.RecordBatch.from_arrays([
        pa.array([1,None],type=pa.int32())
    ],
    [
        'field1'
    ])

with open('/tmp/test.batch','wb') as sink:
    writer=pa.RecordBatchFileWriter(sink, batch.schema)
    writer.write_batch(batch)
    writer.close()

df = pa.ipc.open_file('/tmp/test.batch').read_pandas() 
print(df)

Reply via email to