Yes, I think separate JIRA issues for Java and JS would be best. I'd
recommend having one patch for each, so maybe we can just sort out C++
for now

On Wed, May 22, 2019 at 3:03 PM John Muehlhausen <j...@jgm.org> wrote:
>
> I added this to https://github.com/apache/arrow/pull/4372 and am hoping CI
> will test it for me.  Do Java/JS require separate JIRA entries?
>
> On Wed, May 22, 2019 at 2:53 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > It doesn't appear that Java writes an EOS (4-byte 0) after the last
> > record batch in ArrowFileWriter.java [1] so may want to open a JIRA to
> > make this simple change.
> >
> > Note that this addition (not even a change) is backward compatible,
> > old readers will be unaffected
> >
> > [1]:
> > https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java#L67
> >
> > On Wed, May 22, 2019 at 12:24 PM John Muehlhausen <j...@jgm.org> wrote:
> > >
> > > https://github.com/apache/arrow/pull/4372
> > >
> > > First contribution attempt... sorry in advance if I'm not coloring inside
> > > the lines!
> > >
> > > On Wed, May 22, 2019 at 9:06 AM John Muehlhausen <j...@jgm.org> wrote:
> > >
> > > > I will submit a patch once I get set up for that.  My crystal ball says
> > > > that some people will rely on sequential rather than seeking access to
> > File
> > > > data.  E.g. a utility that can either write the same data to a File or
> > to
> > > > stdout, piped to another process that can read a File from stdin and
> > > > therefore can't seek() it.  Or someone listening to a file with
> > inotify()
> > > > as it is being written and not wanting to trip over the footer once the
> > > > writer writes it.  As a couple of examples.  In any case we have an
> > > > end-of-stream marker so might as well use it. :)
> > > >
> > > > On Wed, May 22, 2019 at 8:56 AM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > >
> > > >> The file format isn't intended to be used in a streaming setting. Use
> > > >> RecordBatchStreamWriter if you need to be able to read a dataset as a
> > > >> stream
> > > >>
> > > >> That being said I don't have a problem with writing the EOS in the
> > > >> file, but the current implementation is not "wrong".
> > > >>
> > > >> On Wed, May 22, 2019 at 8:37 AM John Muehlhausen <j...@jgm.org> wrote:
> > > >> >
> > > >> > I believe the change involves updating the File format notes as
> > above,
> > > >> as
> > > >> > well as something like the following.  The format also mentions
> > "there
> > > >> is
> > > >> > no requirement that dictionary keys should be defined in a
> > > >> DictionaryBatch
> > > >> > before they are used in a RecordBatch, as long as the keys are
> > defined
> > > >> > somewhere in the file."  I'm not sure this is a good idea if the
> > file is
> > > >> > enormous.  The option to process a file as a stream is how we keep
> > > >> memory
> > > >> > usage reasonable?
> > > >> >
> > > >> > I will try to figure out a workflow for recommending changes to the
> > code
> > > >> > base.... merge requests etc.  Any tips appreciated.
> > > >> >
> > > >> >   Status Close() override {
> > > >> > *    // Close stream for compatibility with sequential readers*
> > > >> > *    RETURN_NOT_OK(UpdatePosition());*
> > > >> > *    int32_t end_of_stream_marker = 0;*
> > > >> > *    RETURN_NOT_OK(Write(&end_of_stream_marker, sizeof(int32_t)));*
> > > >> >
> > > >> >     // Write file footer
> > > >> >     RETURN_NOT_OK(UpdatePosition());
> > > >> >     int64_t initial_position = position_;
> > > >> >     RETURN_NOT_OK(WriteFileFooter(*schema_, dictionaries_,
> > > >> record_batches_,
> > > >> > sink_));
> > > >> >
> > > >> >     // Write footer length
> > > >> >     RETURN_NOT_OK(UpdatePosition());
> > > >> >     int32_t footer_length = static_cast<int32_t>(position_ -
> > > >> > initial_position);
> > > >> >     if (footer_length <= 0) {
> > > >> >       return Status::Invalid("Invalid file footer");
> > > >> >     }
> > > >> >
> > > >> >     RETURN_NOT_OK(Write(&footer_length, sizeof(int32_t)));
> > > >> >
> > > >> >     // Write magic bytes to end file
> > > >> >     return Write(kArrowMagicBytes, strlen(kArrowMagicBytes));
> > > >> >   }
> > > >> >
> > > >> >
> > > >> > On Tue, May 21, 2019 at 11:19 PM Micah Kornfield <
> > emkornfi...@gmail.com
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > This seems like a reasonable change.  Is there any reason that we
> > > >> shouldnt
> > > >> > > always append EOS?
> > > >> > >
> > > >> > > On Tuesday, May 21, 2019, John Muehlhausen <j...@jgm.org> wrote:
> > > >> > >
> > > >> > > > Wes,
> > > >> > > >
> > > >> > > > Check out reader.cpp.  It seg faults when it gets to the next
> > > >> > > > message-that-is-not-a-message... it is a footer.  But I have no
> > way
> > > >> to
> > > >> > > > know this in reader.cpp because I'm piping the File in via
> > stdin.
> > > >> > > >
> > > >> > > > In seeker.cpp I seek to the end and figure out where the footer
> > is
> > > >> (this
> > > >> > > > is a py-arrow-written file) and indeed it is at the offset
> > where my
> > > >> > > > "streamed File" reader bombed out.  If EOS were mandatory at
> > this
> > > >> > > location
> > > >> > > > it would have been fine... I would have said "oh, time for the
> > > >> footer!"
> > > >> > > >
> > > >> > > > Basically what I'm saying is that we can't assume that File
> > won't be
> > > >> > > > processed as a stream.  In an actual non-file stream it is
> > either
> > > >> EOS or
> > > >> > > > end-of-stream.  But with a file-as-stream there is more data
> > and we
> > > >> have
> > > >> > > to
> > > >> > > > know it isn't the stream anymore.
> > > >> > > >
> > > >> > > > Otherwise we've locked the File use-cases into those where the
> > File
> > > >> isn't
> > > >> > > > streamed -- i.e. is seekable.  See what I'm saying?  For
> > reader.cpp
> > > >> to
> > > >> > > have
> > > >> > > > been functional it would have had to read the entire File into a
> > > >> buffer
> > > >> > > > before parsing, since it could not seek().  This could be easily
> > > >> avoided
> > > >> > > > with a mandatory EOS in the File format.  Basically:
> > > >> > > >
> > > >> > > > <magic number "ARROW1">
> > > >> > > > <empty padding bytes [to 8 byte boundary]>
> > > >> > > > <STREAMING FORMAT>
> > > >> > > > *<EOS if not in stream>*
> > > >> > > > <FOOTER>
> > > >> > > > <FOOTER SIZE: int32>
> > > >> > > > <magic number "ARROW1">
> > > >> > > >
> > > >> > > > -John
> > > >> > > >
> > > >> > > > On Tue, May 21, 2019 at 4:44 PM Wes McKinney <
> > wesmck...@gmail.com>
> > > >> > > wrote:
> > > >> > > >
> > > >> > > >> hi John,
> > > >> > > >>
> > > >> > > >> I'm not sure I follow. The EOS you're referring to is part of
> > the
> > > >> > > >> streaming format. It's designed to be readable using an
> > InputStream
> > > >> > > >> interface that does not support seeking at all. You can see the
> > > >> core
> > > >> > > >> logic where messages are popped off the InputStream here
> > > >> > > >>
> > > >> > > >>
> > > >> https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f51
> > > >> > > >> 1962c8b012/cpp/src/arrow/ipc/message.cc#L281
> > > >> > > >>
> > > >> > > >> If the end of the byte stream is reached, or EOS (0) is
> > > >> encountered,
> > > >> > > >> then the stream reader stops iteration.
> > > >> > > >>
> > > >> > > >> - Wes
> > > >> > > >>
> > > >> > > >> On Tue, May 21, 2019 at 4:34 PM John Muehlhausen <j...@jgm.org>
> > > >> wrote:
> > > >> > > >> >
> > > >> > > >> > https://arrow.apache.org/docs/format/IPC.html#file-format
> > > >> > > >> >
> > > >> > > >> > <EOS [optional]: int32>
> > > >> > > >> >
> > > >> > > >> > If this stream marker is optional in the file format, doesn't
> > > >> this
> > > >> > > >> prevent
> > > >> > > >> > someone from reading the file without being able to seek()
> > it,
> > > >> e.g. if
> > > >> > > >> it
> > > >> > > >> > is "piped in" to a program?  Or otherwise they'll have to
> > stream
> > > >> in
> > > >> > > the
> > > >> > > >> > entire thing before they can start parsing?
> > > >> > > >> >
> > > >> > > >> > Any reason it can't be mandatory for a File?
> > > >> > > >> >
> > > >> > > >> > -John
> > > >> > > >>
> > > >> > > >
> > > >> > >
> > > >>
> > > >
> >

Reply via email to