My patch for this is finally up https://github.com/apache/arrow/pull/4316
It was kind of a bloodbath, but I think this puts us on a sustainable path and unlocks a lot of efforts that we've been blocked on. On Mon, May 13, 2019 at 10:01 AM Wes McKinney <wesmck...@gmail.com> wrote: > > As I've ventured further in working on this I've realized that it's > not practical (or even a good idea) to continue to maintain the "fixed > dictionary" path. Since the IPC protocol can have evolving > dictionaries, nearly all code paths in the codebase have to change to > work for the variable case, which leaves little to no code left that > is specialized for the "fixed" dictionary case. Additionally, > verifying that a collection of arrays all have the same dictionary is > an inexpensive operation. > > Taking a step back, though, the fundamental design flaw with the > current DictionaryType is that it puts "data in the schema". This > means that the schema can have essentially unbounded size on the wire, > and we've seen issues with Flight already where potentially large > schemas with dictionaries can be serialized more than that need to be. > In general, I believe we should not put any data-dependent data in the > schema -- data belongs to the Array / RecordBatch data structures > > Unfortunately, the consequences of this will be an unavoidable hard > API break in 0.14 (and in the bindings also) because constructors for > DictionaryArray and DictionaryType have to change. I wouldn't propose > a disruptive change like this unless I strongly believed it to be the > correct way forward. Otherwise behaviors (e.g. conversions to/from > pandas, etc.) should remain unchanged > > In any case, I hope to have a patch up with the C++ and Python changes > later today or sometime tomorrow for further discussion > > Thanks > Wes >