I will restructure the existing PR and create new ones (with JIRA) for JS/Java. Just haven't gotten around to it yet.
On Wed, May 22, 2019 at 9:10 PM Wes McKinney <wesmck...@gmail.com> wrote: > Yes, I think separate JIRA issues for Java and JS would be best. I'd > recommend having one patch for each, so maybe we can just sort out C++ > for now > > On Wed, May 22, 2019 at 3:03 PM John Muehlhausen <j...@jgm.org> wrote: > > > > I added this to https://github.com/apache/arrow/pull/4372 and am hoping > CI > > will test it for me. Do Java/JS require separate JIRA entries? > > > > On Wed, May 22, 2019 at 2:53 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > It doesn't appear that Java writes an EOS (4-byte 0) after the last > > > record batch in ArrowFileWriter.java [1] so may want to open a JIRA to > > > make this simple change. > > > > > > Note that this addition (not even a change) is backward compatible, > > > old readers will be unaffected > > > > > > [1]: > > > > https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java#L67 > > > > > > On Wed, May 22, 2019 at 12:24 PM John Muehlhausen <j...@jgm.org> wrote: > > > > > > > > https://github.com/apache/arrow/pull/4372 > > > > > > > > First contribution attempt... sorry in advance if I'm not coloring > inside > > > > the lines! > > > > > > > > On Wed, May 22, 2019 at 9:06 AM John Muehlhausen <j...@jgm.org> > wrote: > > > > > > > > > I will submit a patch once I get set up for that. My crystal ball > says > > > > > that some people will rely on sequential rather than seeking > access to > > > File > > > > > data. E.g. a utility that can either write the same data to a > File or > > > to > > > > > stdout, piped to another process that can read a File from stdin > and > > > > > therefore can't seek() it. Or someone listening to a file with > > > inotify() > > > > > as it is being written and not wanting to trip over the footer > once the > > > > > writer writes it. As a couple of examples. In any case we have an > > > > > end-of-stream marker so might as well use it. :) > > > > > > > > > > On Wed, May 22, 2019 at 8:56 AM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > >> The file format isn't intended to be used in a streaming setting. > Use > > > > >> RecordBatchStreamWriter if you need to be able to read a dataset > as a > > > > >> stream > > > > >> > > > > >> That being said I don't have a problem with writing the EOS in the > > > > >> file, but the current implementation is not "wrong". > > > > >> > > > > >> On Wed, May 22, 2019 at 8:37 AM John Muehlhausen <j...@jgm.org> > wrote: > > > > >> > > > > > >> > I believe the change involves updating the File format notes as > > > above, > > > > >> as > > > > >> > well as something like the following. The format also mentions > > > "there > > > > >> is > > > > >> > no requirement that dictionary keys should be defined in a > > > > >> DictionaryBatch > > > > >> > before they are used in a RecordBatch, as long as the keys are > > > defined > > > > >> > somewhere in the file." I'm not sure this is a good idea if the > > > file is > > > > >> > enormous. The option to process a file as a stream is how we > keep > > > > >> memory > > > > >> > usage reasonable? > > > > >> > > > > > >> > I will try to figure out a workflow for recommending changes to > the > > > code > > > > >> > base.... merge requests etc. Any tips appreciated. > > > > >> > > > > > >> > Status Close() override { > > > > >> > * // Close stream for compatibility with sequential readers* > > > > >> > * RETURN_NOT_OK(UpdatePosition());* > > > > >> > * int32_t end_of_stream_marker = 0;* > > > > >> > * RETURN_NOT_OK(Write(&end_of_stream_marker, > sizeof(int32_t)));* > > > > >> > > > > > >> > // Write file footer > > > > >> > RETURN_NOT_OK(UpdatePosition()); > > > > >> > int64_t initial_position = position_; > > > > >> > RETURN_NOT_OK(WriteFileFooter(*schema_, dictionaries_, > > > > >> record_batches_, > > > > >> > sink_)); > > > > >> > > > > > >> > // Write footer length > > > > >> > RETURN_NOT_OK(UpdatePosition()); > > > > >> > int32_t footer_length = static_cast<int32_t>(position_ - > > > > >> > initial_position); > > > > >> > if (footer_length <= 0) { > > > > >> > return Status::Invalid("Invalid file footer"); > > > > >> > } > > > > >> > > > > > >> > RETURN_NOT_OK(Write(&footer_length, sizeof(int32_t))); > > > > >> > > > > > >> > // Write magic bytes to end file > > > > >> > return Write(kArrowMagicBytes, strlen(kArrowMagicBytes)); > > > > >> > } > > > > >> > > > > > >> > > > > > >> > On Tue, May 21, 2019 at 11:19 PM Micah Kornfield < > > > emkornfi...@gmail.com > > > > >> > > > > > >> > wrote: > > > > >> > > > > > >> > > This seems like a reasonable change. Is there any reason > that we > > > > >> shouldnt > > > > >> > > always append EOS? > > > > >> > > > > > > >> > > On Tuesday, May 21, 2019, John Muehlhausen <j...@jgm.org> > wrote: > > > > >> > > > > > > >> > > > Wes, > > > > >> > > > > > > > >> > > > Check out reader.cpp. It seg faults when it gets to the > next > > > > >> > > > message-that-is-not-a-message... it is a footer. But I > have no > > > way > > > > >> to > > > > >> > > > know this in reader.cpp because I'm piping the File in via > > > stdin. > > > > >> > > > > > > > >> > > > In seeker.cpp I seek to the end and figure out where the > footer > > > is > > > > >> (this > > > > >> > > > is a py-arrow-written file) and indeed it is at the offset > > > where my > > > > >> > > > "streamed File" reader bombed out. If EOS were mandatory at > > > this > > > > >> > > location > > > > >> > > > it would have been fine... I would have said "oh, time for > the > > > > >> footer!" > > > > >> > > > > > > > >> > > > Basically what I'm saying is that we can't assume that File > > > won't be > > > > >> > > > processed as a stream. In an actual non-file stream it is > > > either > > > > >> EOS or > > > > >> > > > end-of-stream. But with a file-as-stream there is more data > > > and we > > > > >> have > > > > >> > > to > > > > >> > > > know it isn't the stream anymore. > > > > >> > > > > > > > >> > > > Otherwise we've locked the File use-cases into those where > the > > > File > > > > >> isn't > > > > >> > > > streamed -- i.e. is seekable. See what I'm saying? For > > > reader.cpp > > > > >> to > > > > >> > > have > > > > >> > > > been functional it would have had to read the entire File > into a > > > > >> buffer > > > > >> > > > before parsing, since it could not seek(). This could be > easily > > > > >> avoided > > > > >> > > > with a mandatory EOS in the File format. Basically: > > > > >> > > > > > > > >> > > > <magic number "ARROW1"> > > > > >> > > > <empty padding bytes [to 8 byte boundary]> > > > > >> > > > <STREAMING FORMAT> > > > > >> > > > *<EOS if not in stream>* > > > > >> > > > <FOOTER> > > > > >> > > > <FOOTER SIZE: int32> > > > > >> > > > <magic number "ARROW1"> > > > > >> > > > > > > > >> > > > -John > > > > >> > > > > > > > >> > > > On Tue, May 21, 2019 at 4:44 PM Wes McKinney < > > > wesmck...@gmail.com> > > > > >> > > wrote: > > > > >> > > > > > > > >> > > >> hi John, > > > > >> > > >> > > > > >> > > >> I'm not sure I follow. The EOS you're referring to is part > of > > > the > > > > >> > > >> streaming format. It's designed to be readable using an > > > InputStream > > > > >> > > >> interface that does not support seeking at all. You can > see the > > > > >> core > > > > >> > > >> logic where messages are popped off the InputStream here > > > > >> > > >> > > > > >> > > >> > > > > >> > https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f51 > > > > >> > > >> 1962c8b012/cpp/src/arrow/ipc/message.cc#L281 > > > > >> > > >> > > > > >> > > >> If the end of the byte stream is reached, or EOS (0) is > > > > >> encountered, > > > > >> > > >> then the stream reader stops iteration. > > > > >> > > >> > > > > >> > > >> - Wes > > > > >> > > >> > > > > >> > > >> On Tue, May 21, 2019 at 4:34 PM John Muehlhausen < > j...@jgm.org> > > > > >> wrote: > > > > >> > > >> > > > > > >> > > >> > > https://arrow.apache.org/docs/format/IPC.html#file-format > > > > >> > > >> > > > > > >> > > >> > <EOS [optional]: int32> > > > > >> > > >> > > > > > >> > > >> > If this stream marker is optional in the file format, > doesn't > > > > >> this > > > > >> > > >> prevent > > > > >> > > >> > someone from reading the file without being able to > seek() > > > it, > > > > >> e.g. if > > > > >> > > >> it > > > > >> > > >> > is "piped in" to a program? Or otherwise they'll have to > > > stream > > > > >> in > > > > >> > > the > > > > >> > > >> > entire thing before they can start parsing? > > > > >> > > >> > > > > > >> > > >> > Any reason it can't be mandatory for a File? > > > > >> > > >> > > > > > >> > > >> > -John > > > > >> > > >> > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > >