https://github.com/apache/arrow/pull/4372

First contribution attempt... sorry in advance if I'm not coloring inside
the lines!

On Wed, May 22, 2019 at 9:06 AM John Muehlhausen <j...@jgm.org> wrote:

> I will submit a patch once I get set up for that.  My crystal ball says
> that some people will rely on sequential rather than seeking access to File
> data.  E.g. a utility that can either write the same data to a File or to
> stdout, piped to another process that can read a File from stdin and
> therefore can't seek() it.  Or someone listening to a file with inotify()
> as it is being written and not wanting to trip over the footer once the
> writer writes it.  As a couple of examples.  In any case we have an
> end-of-stream marker so might as well use it. :)
>
> On Wed, May 22, 2019 at 8:56 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> The file format isn't intended to be used in a streaming setting. Use
>> RecordBatchStreamWriter if you need to be able to read a dataset as a
>> stream
>>
>> That being said I don't have a problem with writing the EOS in the
>> file, but the current implementation is not "wrong".
>>
>> On Wed, May 22, 2019 at 8:37 AM John Muehlhausen <j...@jgm.org> wrote:
>> >
>> > I believe the change involves updating the File format notes as above,
>> as
>> > well as something like the following.  The format also mentions "there
>> is
>> > no requirement that dictionary keys should be defined in a
>> DictionaryBatch
>> > before they are used in a RecordBatch, as long as the keys are defined
>> > somewhere in the file."  I'm not sure this is a good idea if the file is
>> > enormous.  The option to process a file as a stream is how we keep
>> memory
>> > usage reasonable?
>> >
>> > I will try to figure out a workflow for recommending changes to the code
>> > base.... merge requests etc.  Any tips appreciated.
>> >
>> >   Status Close() override {
>> > *    // Close stream for compatibility with sequential readers*
>> > *    RETURN_NOT_OK(UpdatePosition());*
>> > *    int32_t end_of_stream_marker = 0;*
>> > *    RETURN_NOT_OK(Write(&end_of_stream_marker, sizeof(int32_t)));*
>> >
>> >     // Write file footer
>> >     RETURN_NOT_OK(UpdatePosition());
>> >     int64_t initial_position = position_;
>> >     RETURN_NOT_OK(WriteFileFooter(*schema_, dictionaries_,
>> record_batches_,
>> > sink_));
>> >
>> >     // Write footer length
>> >     RETURN_NOT_OK(UpdatePosition());
>> >     int32_t footer_length = static_cast<int32_t>(position_ -
>> > initial_position);
>> >     if (footer_length <= 0) {
>> >       return Status::Invalid("Invalid file footer");
>> >     }
>> >
>> >     RETURN_NOT_OK(Write(&footer_length, sizeof(int32_t)));
>> >
>> >     // Write magic bytes to end file
>> >     return Write(kArrowMagicBytes, strlen(kArrowMagicBytes));
>> >   }
>> >
>> >
>> > On Tue, May 21, 2019 at 11:19 PM Micah Kornfield <emkornfi...@gmail.com
>> >
>> > wrote:
>> >
>> > > This seems like a reasonable change.  Is there any reason that we
>> shouldnt
>> > > always append EOS?
>> > >
>> > > On Tuesday, May 21, 2019, John Muehlhausen <j...@jgm.org> wrote:
>> > >
>> > > > Wes,
>> > > >
>> > > > Check out reader.cpp.  It seg faults when it gets to the next
>> > > > message-that-is-not-a-message... it is a footer.  But I have no way
>> to
>> > > > know this in reader.cpp because I'm piping the File in via stdin.
>> > > >
>> > > > In seeker.cpp I seek to the end and figure out where the footer is
>> (this
>> > > > is a py-arrow-written file) and indeed it is at the offset where my
>> > > > "streamed File" reader bombed out.  If EOS were mandatory at this
>> > > location
>> > > > it would have been fine... I would have said "oh, time for the
>> footer!"
>> > > >
>> > > > Basically what I'm saying is that we can't assume that File won't be
>> > > > processed as a stream.  In an actual non-file stream it is either
>> EOS or
>> > > > end-of-stream.  But with a file-as-stream there is more data and we
>> have
>> > > to
>> > > > know it isn't the stream anymore.
>> > > >
>> > > > Otherwise we've locked the File use-cases into those where the File
>> isn't
>> > > > streamed -- i.e. is seekable.  See what I'm saying?  For reader.cpp
>> to
>> > > have
>> > > > been functional it would have had to read the entire File into a
>> buffer
>> > > > before parsing, since it could not seek().  This could be easily
>> avoided
>> > > > with a mandatory EOS in the File format.  Basically:
>> > > >
>> > > > <magic number "ARROW1">
>> > > > <empty padding bytes [to 8 byte boundary]>
>> > > > <STREAMING FORMAT>
>> > > > *<EOS if not in stream>*
>> > > > <FOOTER>
>> > > > <FOOTER SIZE: int32>
>> > > > <magic number "ARROW1">
>> > > >
>> > > > -John
>> > > >
>> > > > On Tue, May 21, 2019 at 4:44 PM Wes McKinney <wesmck...@gmail.com>
>> > > wrote:
>> > > >
>> > > >> hi John,
>> > > >>
>> > > >> I'm not sure I follow. The EOS you're referring to is part of the
>> > > >> streaming format. It's designed to be readable using an InputStream
>> > > >> interface that does not support seeking at all. You can see the
>> core
>> > > >> logic where messages are popped off the InputStream here
>> > > >>
>> > > >>
>> https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f51
>> > > >> 1962c8b012/cpp/src/arrow/ipc/message.cc#L281
>> > > >>
>> > > >> If the end of the byte stream is reached, or EOS (0) is
>> encountered,
>> > > >> then the stream reader stops iteration.
>> > > >>
>> > > >> - Wes
>> > > >>
>> > > >> On Tue, May 21, 2019 at 4:34 PM John Muehlhausen <j...@jgm.org>
>> wrote:
>> > > >> >
>> > > >> > https://arrow.apache.org/docs/format/IPC.html#file-format
>> > > >> >
>> > > >> > <EOS [optional]: int32>
>> > > >> >
>> > > >> > If this stream marker is optional in the file format, doesn't
>> this
>> > > >> prevent
>> > > >> > someone from reading the file without being able to seek() it,
>> e.g. if
>> > > >> it
>> > > >> > is "piped in" to a program?  Or otherwise they'll have to stream
>> in
>> > > the
>> > > >> > entire thing before they can start parsing?
>> > > >> >
>> > > >> > Any reason it can't be mandatory for a File?
>> > > >> >
>> > > >> > -John
>> > > >>
>> > > >
>> > >
>>
>

Reply via email to