https://github.com/apache/arrow/pull/4372
First contribution attempt... sorry in advance if I'm not coloring inside the lines! On Wed, May 22, 2019 at 9:06 AM John Muehlhausen <j...@jgm.org> wrote: > I will submit a patch once I get set up for that. My crystal ball says > that some people will rely on sequential rather than seeking access to File > data. E.g. a utility that can either write the same data to a File or to > stdout, piped to another process that can read a File from stdin and > therefore can't seek() it. Or someone listening to a file with inotify() > as it is being written and not wanting to trip over the footer once the > writer writes it. As a couple of examples. In any case we have an > end-of-stream marker so might as well use it. :) > > On Wed, May 22, 2019 at 8:56 AM Wes McKinney <wesmck...@gmail.com> wrote: > >> The file format isn't intended to be used in a streaming setting. Use >> RecordBatchStreamWriter if you need to be able to read a dataset as a >> stream >> >> That being said I don't have a problem with writing the EOS in the >> file, but the current implementation is not "wrong". >> >> On Wed, May 22, 2019 at 8:37 AM John Muehlhausen <j...@jgm.org> wrote: >> > >> > I believe the change involves updating the File format notes as above, >> as >> > well as something like the following. The format also mentions "there >> is >> > no requirement that dictionary keys should be defined in a >> DictionaryBatch >> > before they are used in a RecordBatch, as long as the keys are defined >> > somewhere in the file." I'm not sure this is a good idea if the file is >> > enormous. The option to process a file as a stream is how we keep >> memory >> > usage reasonable? >> > >> > I will try to figure out a workflow for recommending changes to the code >> > base.... merge requests etc. Any tips appreciated. >> > >> > Status Close() override { >> > * // Close stream for compatibility with sequential readers* >> > * RETURN_NOT_OK(UpdatePosition());* >> > * int32_t end_of_stream_marker = 0;* >> > * RETURN_NOT_OK(Write(&end_of_stream_marker, sizeof(int32_t)));* >> > >> > // Write file footer >> > RETURN_NOT_OK(UpdatePosition()); >> > int64_t initial_position = position_; >> > RETURN_NOT_OK(WriteFileFooter(*schema_, dictionaries_, >> record_batches_, >> > sink_)); >> > >> > // Write footer length >> > RETURN_NOT_OK(UpdatePosition()); >> > int32_t footer_length = static_cast<int32_t>(position_ - >> > initial_position); >> > if (footer_length <= 0) { >> > return Status::Invalid("Invalid file footer"); >> > } >> > >> > RETURN_NOT_OK(Write(&footer_length, sizeof(int32_t))); >> > >> > // Write magic bytes to end file >> > return Write(kArrowMagicBytes, strlen(kArrowMagicBytes)); >> > } >> > >> > >> > On Tue, May 21, 2019 at 11:19 PM Micah Kornfield <emkornfi...@gmail.com >> > >> > wrote: >> > >> > > This seems like a reasonable change. Is there any reason that we >> shouldnt >> > > always append EOS? >> > > >> > > On Tuesday, May 21, 2019, John Muehlhausen <j...@jgm.org> wrote: >> > > >> > > > Wes, >> > > > >> > > > Check out reader.cpp. It seg faults when it gets to the next >> > > > message-that-is-not-a-message... it is a footer. But I have no way >> to >> > > > know this in reader.cpp because I'm piping the File in via stdin. >> > > > >> > > > In seeker.cpp I seek to the end and figure out where the footer is >> (this >> > > > is a py-arrow-written file) and indeed it is at the offset where my >> > > > "streamed File" reader bombed out. If EOS were mandatory at this >> > > location >> > > > it would have been fine... I would have said "oh, time for the >> footer!" >> > > > >> > > > Basically what I'm saying is that we can't assume that File won't be >> > > > processed as a stream. In an actual non-file stream it is either >> EOS or >> > > > end-of-stream. But with a file-as-stream there is more data and we >> have >> > > to >> > > > know it isn't the stream anymore. >> > > > >> > > > Otherwise we've locked the File use-cases into those where the File >> isn't >> > > > streamed -- i.e. is seekable. See what I'm saying? For reader.cpp >> to >> > > have >> > > > been functional it would have had to read the entire File into a >> buffer >> > > > before parsing, since it could not seek(). This could be easily >> avoided >> > > > with a mandatory EOS in the File format. Basically: >> > > > >> > > > <magic number "ARROW1"> >> > > > <empty padding bytes [to 8 byte boundary]> >> > > > <STREAMING FORMAT> >> > > > *<EOS if not in stream>* >> > > > <FOOTER> >> > > > <FOOTER SIZE: int32> >> > > > <magic number "ARROW1"> >> > > > >> > > > -John >> > > > >> > > > On Tue, May 21, 2019 at 4:44 PM Wes McKinney <wesmck...@gmail.com> >> > > wrote: >> > > > >> > > >> hi John, >> > > >> >> > > >> I'm not sure I follow. The EOS you're referring to is part of the >> > > >> streaming format. It's designed to be readable using an InputStream >> > > >> interface that does not support seeking at all. You can see the >> core >> > > >> logic where messages are popped off the InputStream here >> > > >> >> > > >> >> https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f51 >> > > >> 1962c8b012/cpp/src/arrow/ipc/message.cc#L281 >> > > >> >> > > >> If the end of the byte stream is reached, or EOS (0) is >> encountered, >> > > >> then the stream reader stops iteration. >> > > >> >> > > >> - Wes >> > > >> >> > > >> On Tue, May 21, 2019 at 4:34 PM John Muehlhausen <j...@jgm.org> >> wrote: >> > > >> > >> > > >> > https://arrow.apache.org/docs/format/IPC.html#file-format >> > > >> > >> > > >> > <EOS [optional]: int32> >> > > >> > >> > > >> > If this stream marker is optional in the file format, doesn't >> this >> > > >> prevent >> > > >> > someone from reading the file without being able to seek() it, >> e.g. if >> > > >> it >> > > >> > is "piped in" to a program? Or otherwise they'll have to stream >> in >> > > the >> > > >> > entire thing before they can start parsing? >> > > >> > >> > > >> > Any reason it can't be mandatory for a File? >> > > >> > >> > > >> > -John >> > > >> >> > > > >> > > >> >