I am also under the impression that the file format is supposed to support deltas, but not replacements. Is this not implemented in C++?
On Thu, Mar 18, 2021 at 9:57 PM Nate Bauernfeind <nate.bauernfe...@gmail.com> wrote: > If dictionary replacements were supported, then the IPC file format > couldn't guarantee random access reads. > > Personally, I would like to support a stream-based file format that is a > series of the Flight protobufs. In my extension of arrow flight, by > stuffing our state-based data into the app_metadata field on the FlightData > object, we can't write down a stream natively in the IPC based file format > (for testing, or sharing the reproduction of an error). In particular, the > IPC format is based around the flatbuffer payloads instead of the Flight > protobuf payloads. It might be nice to support an additional type of IPC > file for stateful streams. If interested, it would be easy to integrate > with the existing code using a different magic field in the footer (such as > 'FLGHT1', instead of 'ARROW1'). In addition to the offsets and sizes of > payloads, it might be nice to indicate the type of payload (RecordBatch vs > DictionaryBatch, etc). We wouldn't have O(1) random access, but I think in > the "replay of a stream" scenario, one probably isn't looking for random > access anyways. > > On Thu, Mar 18, 2021 at 9:07 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > Hmm, I noticed this "The IPC file format doesn't support dictionary > > replacements or deltas." I was under the impression we aimed to support > > dictionary deltas in the file format. If not we should remove "Delta > > dictionaries are applied in the order they appear in the file footer." > from > > the specification. > > > > On Thu, Mar 18, 2021 at 8:48 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > It's a bit more configurable, but basically yes. See the IPC write > > > options: > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L73 > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 18/03/2021 à 16:37, Jacob Quinn a écrit : > > > > Ah, interesting. So to make sure I understand correctly, the C++ > write > > > > implementation will scan all "batches" and unify all dictionary > values > > > > before writing out the schema + dictionary messages? But only when > > > writing > > > > the file format? In the streaming case, it would still write > > > > replacement/delta dictionary messages as needed. > > > > > > > > -Jacob > > > > > > > > On Thu, Mar 18, 2021 at 9:10 AM Neal Richardson < > > > neal.p.richard...@gmail.com> > > > > wrote: > > > > > > > >> Somewhat related issue: > > > https://issues.apache.org/jira/browse/ARROW-10406 > > > >> > > > >> On Wed, Mar 17, 2021 at 11:22 PM Micah Kornfield < > > emkornfi...@gmail.com > > > > > > > >> wrote: > > > >> > > > >>> BTW, this nuance always felt a little strange to me, but would have > > > >>> required adding additional information to the file format, to > > > >> disambiguate > > > >>> when exactly a dictionary was intended to be replaced. > > > >>> > > > >>> On Wed, Mar 17, 2021 at 11:19 PM Micah Kornfield < > > > emkornfi...@gmail.com> > > > >>> wrote: > > > >>> > > > >>>> Hi Jacob, > > > >>>> There is nuance. The file format does not support dictionary > > > >>> replacement, > > > >>>> the specification [1] why that is currently the case. Only the > > > "stream > > > >>>> format" supports replacement (i.e. no magic number, only schema > > > >> followed > > > >>> by > > > >>>> one or more dictionary/record-batch messages). > > > >>>> > > > >>>> -Micah > > > >>>> > > > >>>> [1] > > > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > > > >>>> > > > >>>> On Wed, Mar 17, 2021 at 11:04 PM Jacob Quinn < > > quinn.jac...@gmail.com> > > > >>>> wrote: > > > >>>> > > > >>>>> Had an issue come up here: > > > >>>>> > > > >> > > https://github.com/JuliaData/Arrow.jl/issues/129#issuecomment-777350450 > > > >>> . > > > >>>>> From the implementation status page, it says C++ supports > > > replacement > > > >>>>> dictionaries and that python tracks the C++ implementation. Is > this > > > >>> just a > > > >>>>> pyarrow issue where it specifically doesn't support replacement > > > >>>>> dictionaries? Or it's not "hooked in" properly? > > > >>>>> > > > >>>>> -Jacob > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > >