I believe the change involves updating the File format notes as above, as
well as something like the following.  The format also mentions "there is
no requirement that dictionary keys should be defined in a DictionaryBatch
before they are used in a RecordBatch, as long as the keys are defined
somewhere in the file."  I'm not sure this is a good idea if the file is
enormous.  The option to process a file as a stream is how we keep memory
usage reasonable?

I will try to figure out a workflow for recommending changes to the code
base.... merge requests etc.  Any tips appreciated.

  Status Close() override {
*    // Close stream for compatibility with sequential readers*
*    RETURN_NOT_OK(UpdatePosition());*
*    int32_t end_of_stream_marker = 0;*
*    RETURN_NOT_OK(Write(&end_of_stream_marker, sizeof(int32_t)));*

    // Write file footer
    RETURN_NOT_OK(UpdatePosition());
    int64_t initial_position = position_;
    RETURN_NOT_OK(WriteFileFooter(*schema_, dictionaries_, record_batches_,
sink_));

    // Write footer length
    RETURN_NOT_OK(UpdatePosition());
    int32_t footer_length = static_cast<int32_t>(position_ -
initial_position);
    if (footer_length <= 0) {
      return Status::Invalid("Invalid file footer");
    }

    RETURN_NOT_OK(Write(&footer_length, sizeof(int32_t)));

    // Write magic bytes to end file
    return Write(kArrowMagicBytes, strlen(kArrowMagicBytes));
  }


On Tue, May 21, 2019 at 11:19 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> This seems like a reasonable change.  Is there any reason that we shouldnt
> always append EOS?
>
> On Tuesday, May 21, 2019, John Muehlhausen <j...@jgm.org> wrote:
>
> > Wes,
> >
> > Check out reader.cpp.  It seg faults when it gets to the next
> > message-that-is-not-a-message... it is a footer.  But I have no way to
> > know this in reader.cpp because I'm piping the File in via stdin.
> >
> > In seeker.cpp I seek to the end and figure out where the footer is (this
> > is a py-arrow-written file) and indeed it is at the offset where my
> > "streamed File" reader bombed out.  If EOS were mandatory at this
> location
> > it would have been fine... I would have said "oh, time for the footer!"
> >
> > Basically what I'm saying is that we can't assume that File won't be
> > processed as a stream.  In an actual non-file stream it is either EOS or
> > end-of-stream.  But with a file-as-stream there is more data and we have
> to
> > know it isn't the stream anymore.
> >
> > Otherwise we've locked the File use-cases into those where the File isn't
> > streamed -- i.e. is seekable.  See what I'm saying?  For reader.cpp to
> have
> > been functional it would have had to read the entire File into a buffer
> > before parsing, since it could not seek().  This could be easily avoided
> > with a mandatory EOS in the File format.  Basically:
> >
> > <magic number "ARROW1">
> > <empty padding bytes [to 8 byte boundary]>
> > <STREAMING FORMAT>
> > *<EOS if not in stream>*
> > <FOOTER>
> > <FOOTER SIZE: int32>
> > <magic number "ARROW1">
> >
> > -John
> >
> > On Tue, May 21, 2019 at 4:44 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> >> hi John,
> >>
> >> I'm not sure I follow. The EOS you're referring to is part of the
> >> streaming format. It's designed to be readable using an InputStream
> >> interface that does not support seeking at all. You can see the core
> >> logic where messages are popped off the InputStream here
> >>
> >> https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f51
> >> 1962c8b012/cpp/src/arrow/ipc/message.cc#L281
> >>
> >> If the end of the byte stream is reached, or EOS (0) is encountered,
> >> then the stream reader stops iteration.
> >>
> >> - Wes
> >>
> >> On Tue, May 21, 2019 at 4:34 PM John Muehlhausen <j...@jgm.org> wrote:
> >> >
> >> > https://arrow.apache.org/docs/format/IPC.html#file-format
> >> >
> >> > <EOS [optional]: int32>
> >> >
> >> > If this stream marker is optional in the file format, doesn't this
> >> prevent
> >> > someone from reading the file without being able to seek() it, e.g. if
> >> it
> >> > is "piped in" to a program?  Or otherwise they'll have to stream in
> the
> >> > entire thing before they can start parsing?
> >> >
> >> > Any reason it can't be mandatory for a File?
> >> >
> >> > -John
> >>
> >
>

Reply via email to