If dictionary replacements were supported, then the IPC file format couldn't guarantee random access reads.
Personally, I would like to support a stream-based file format that is a series of the Flight protobufs. In my extension of arrow flight, by stuffing our state-based data into the app_metadata field on the FlightData object, we can't write down a stream natively in the IPC based file format (for testing, or sharing the reproduction of an error). In particular, the IPC format is based around the flatbuffer payloads instead of the Flight protobuf payloads. It might be nice to support an additional type of IPC file for stateful streams. If interested, it would be easy to integrate with the existing code using a different magic field in the footer (such as 'FLGHT1', instead of 'ARROW1'). In addition to the offsets and sizes of payloads, it might be nice to indicate the type of payload (RecordBatch vs DictionaryBatch, etc). We wouldn't have O(1) random access, but I think in the "replay of a stream" scenario, one probably isn't looking for random access anyways. On Thu, Mar 18, 2021 at 9:07 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hmm, I noticed this "The IPC file format doesn't support dictionary > replacements or deltas." I was under the impression we aimed to support > dictionary deltas in the file format. If not we should remove "Delta > dictionaries are applied in the order they appear in the file footer." from > the specification. > > On Thu, Mar 18, 2021 at 8:48 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > It's a bit more configurable, but basically yes. See the IPC write > > options: > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L73 > > > > Regards > > > > Antoine. > > > > > > Le 18/03/2021 à 16:37, Jacob Quinn a écrit : > > > Ah, interesting. So to make sure I understand correctly, the C++ write > > > implementation will scan all "batches" and unify all dictionary values > > > before writing out the schema + dictionary messages? But only when > > writing > > > the file format? In the streaming case, it would still write > > > replacement/delta dictionary messages as needed. > > > > > > -Jacob > > > > > > On Thu, Mar 18, 2021 at 9:10 AM Neal Richardson < > > neal.p.richard...@gmail.com> > > > wrote: > > > > > >> Somewhat related issue: > > https://issues.apache.org/jira/browse/ARROW-10406 > > >> > > >> On Wed, Mar 17, 2021 at 11:22 PM Micah Kornfield < > emkornfi...@gmail.com > > > > > >> wrote: > > >> > > >>> BTW, this nuance always felt a little strange to me, but would have > > >>> required adding additional information to the file format, to > > >> disambiguate > > >>> when exactly a dictionary was intended to be replaced. > > >>> > > >>> On Wed, Mar 17, 2021 at 11:19 PM Micah Kornfield < > > emkornfi...@gmail.com> > > >>> wrote: > > >>> > > >>>> Hi Jacob, > > >>>> There is nuance. The file format does not support dictionary > > >>> replacement, > > >>>> the specification [1] why that is currently the case. Only the > > "stream > > >>>> format" supports replacement (i.e. no magic number, only schema > > >> followed > > >>> by > > >>>> one or more dictionary/record-batch messages). > > >>>> > > >>>> -Micah > > >>>> > > >>>> [1] > > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > > >>>> > > >>>> On Wed, Mar 17, 2021 at 11:04 PM Jacob Quinn < > quinn.jac...@gmail.com> > > >>>> wrote: > > >>>> > > >>>>> Had an issue come up here: > > >>>>> > > >> > https://github.com/JuliaData/Arrow.jl/issues/129#issuecomment-777350450 > > >>> . > > >>>>> From the implementation status page, it says C++ supports > > replacement > > >>>>> dictionaries and that python tracks the C++ implementation. Is this > > >>> just a > > >>>>> pyarrow issue where it specifically doesn't support replacement > > >>>>> dictionaries? Or it's not "hooked in" properly? > > >>>>> > > >>>>> -Jacob > > >>>>> > > >>>> > > >>> > > >> > > > > > >