My patch for this is finally up

https://github.com/apache/arrow/pull/4316

It was kind of a bloodbath, but I think this puts us on a sustainable
path and unlocks a lot of efforts that we've been blocked on.

On Mon, May 13, 2019 at 10:01 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> As I've ventured further in working on this I've realized that it's
> not practical (or even a good idea) to continue to maintain the "fixed
> dictionary" path. Since the IPC protocol can have evolving
> dictionaries, nearly all code paths in the codebase have to change to
> work for the variable case, which leaves little to no code left that
> is specialized for the "fixed" dictionary case. Additionally,
> verifying that a collection of arrays all have the same dictionary is
> an inexpensive operation.
>
> Taking a step back, though, the fundamental design flaw with the
> current DictionaryType is that it puts "data in the schema". This
> means that the schema can have essentially unbounded size on the wire,
> and we've seen issues with Flight already where potentially large
> schemas with dictionaries can be serialized more than that need to be.
> In general, I believe we should not put any data-dependent data in the
> schema -- data belongs to the Array / RecordBatch data structures
>
> Unfortunately, the consequences of this will be an unavoidable hard
> API break in 0.14 (and in the bindings also) because constructors for
> DictionaryArray and DictionaryType have to change. I wouldn't propose
> a disruptive change like this unless I strongly believed it to be the
> correct way forward. Otherwise behaviors (e.g. conversions to/from
> pandas, etc.) should remain unchanged
>
> In any case, I hope to have a patch up with the C++ and Python changes
> later today or sometime tomorrow for further discussion
>
> Thanks
> Wes
>

Reply via email to