IIRC, the main issue that the community did not want to tackle at the time
was determining which dictionary applied to which particular record batch
when it comes to random access.  The IPC File Footer [1] does not contain
enough information to do this without using heuristics.

Thanks,
Micah

[1] https://github.com/apache/arrow/blob/main/format/File.fbs#L26

On Wed, Oct 11, 2023 at 4:52 PM Chris Larsen <[email protected]>
wrote:

> Hi folks,
>
> The IPC file format notes that it is "invalid to have more than one
> non-delta dictionary batch per dictionary ID (i.e. dictionary replacement
> is not supported)" but there is the "isDelta" flag that indicates
> replacement dictionaries are supported. However it isn't clear that this
> only applies to streams.
>
> I've tried finding context around this [2] [3] [4] but I think there was
> another use case where I want to be able to stream data in blocks to a file
> system but then on read, process each data block and associated dictionary
> in parallel. Dictionary replacement helps with the parallel read case in
> that each data block can load associated dictionary blocks without having
> to read multiple dictionaries up to the associated data block.
>
> Given the delta flag, is there any reason not to support replacement
> dictionaries in the file format?
>
> [1]
>
> https://github.com/apache/arrow/pull/5585/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60R1027-R1030
> [2] https://github.com/apache/arrow/issues/22842
> [3] https://lists.apache.org/thread/2h3o1kbk0t9l16wxp51wdtnz16yqg03d
> [4] https://lists.apache.org/thread/31910z7g64np3dmblokbh1llmxgt74y7
>

Reply via email to