hi Paul,

Dictionary-encoded is not a nested type, so there shouldn't be any
children -- the IPC layout of a dictionary encoded field is that same
as the type of the indices (probably want to change the terminology in
the Rust library from "keys" to "indices" which is what's used in the
specification). And the dictionaries are sent separately in record
batches

> The Rust implementation of DictionaryArray keeps the values as a child to the 
> keys array.

What we do in C++ is that the dictionary is a separate member of the
ArrayData data structure. I don't know enough about the Rust library
to know whether that's the right design choice or not.

Implementing Rust support for integration tests would make it much
easier to validate that things are implemented correctly. Does anyone
know how far away the Rust library might be from having them? It's not
the most fun thing to implement, but it's very important.

- Wes

On Thu, Apr 9, 2020 at 5:08 PM Paul Dix <p...@influxdata.com> wrote:
>
> I managed to get something up and running. I ended up creating a
> dictionary_batch.rs and adding that to convert.rs to translate dictionary
> fields in a schema over to the correct fb thing. I also added a method to
> writer.rs to convert that to bytes so it can be sent via ipc. However, when
> writing out the subsequent record batches that contain a dictionary field
> that references the dictionary sent in the dictionary batch, it runs into a
> problem. The Rust implementation of DictionaryArray keeps the values as a
> child to the keys array.
>
> You can see the chain of calls when calling finish on the
> StringDictionaryBuilder
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
> which calls out to finish the dictionary
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
> which then calls from to create the DictionaryArray
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
> And that last part validates that the child is there.
>
> And here in the writer is where it blindly writes out any children:
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378
>
> I found that when I sent this out, the Python reader on the other side
> would choke on it when I tried to read_pandas. However, if I skip
> serializing the children for DictionaryArray only, then everything works
> exactly as expected.
>
> So I'm not sure what the right thing here is because I don't know if
> DictionaryArrays should not include the children (i.e. values) when being
> written out or if there's some other thing that should be happening. Should
> the raw arrow data have the values in that dictionary? Or does it get
> reconstituted manually on the other side from the DictionaryBatch?
>
> On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > As another item for consideration -- in C++ at least, the dictionary
> > id is dealt with as an internal detail of the IPC message production
> > process. When serializing the Schema, id's are assigned to each
> > dictionary-encoded field in the DictionaryMemo object, see
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
> >
> > When record batches are reconstructed, the dictionary corresponding to
> > an id at the time of reconstruction is set in the Array's internal
> > data -- that's the "dictionary" member of the ArrayData object
> > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
> >
> > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <wesmck...@gmail.com> wrote:
> > >
> > > hey Paul,
> > >
> > > Take a look at how dictionaries work in the IPC protocol
> > >
> > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> > >
> > > Dictionaries are sent as separate messages. When a field is tagged as
> > > dictionary encoded in the schema, the IPC reader must keep track of
> > > the dictionaries it's seen come across the protocol and then set them
> > > in the reconstructed record batches when a record batch comes through.
> > >
> > > Note that the protocol now supports dictionary deltas (dictionaries
> > > can be appended to by subsequent messages for the same dictionary id)
> > > and replacements (new dictionary for an id).
> > >
> > > I don't know what the status of handling dictionaries in the Rust IPC,
> > > but it would be a good idea to take time to take into account the
> > > above details.
> > >
> > > Finally, note that Rust is not participating in either the regular IPC
> > > nor Flight integration tests. This is an important milestone to being
> > > able to depend on the Rust library in production.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <p...@influxdata.com> wrote:
> > > >
> > > > Hello,
> > > > I'm trying to build a Rust based Flight server and I'd like to use
> > > > Dictionary encoding for a number of string columns in my data. I've
> > seen
> > > > that StringDictionary was recently added to Rust here:
> > > >
> > https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab
> > .
> > > >
> > > > However, that doesn't seem to reach down into Flight. When I attempt to
> > > > send a schema through flight that has a Dictionary<UInt8, Utf8> it
> > throws
> > > > an error when attempting to convert from the Rust type to the
> > Flatbuffer
> > > > field type. I figured I'd take a swing at adding that to convert.rs
> > here:
> > > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> > > >
> > > > However, when I look at the definitions in Schema.fbs and the related
> > > > generated Rust file, Dictionary isn't a type there. Should I be sending
> > > > this down as some other composed type? And if so, how does this look
> > at the
> > > > client side of things? In my test I'm connecting to the Flight server
> > via
> > > > PyArrow and working with it in Pandas so I'm hoping that it will be
> > able to
> > > > consume Dictionary fields.
> > > >
> > > > Separately, the Rust field type doesn't have a spot for the dictionary
> > ID,
> > > > which I assume I'll need to send down so it can be consumed on the
> > client.
> > > > Would appreciate any thoughts on that. A little push in the right
> > direction
> > > > and I'll be happy to submit a PR to help push the Rust Flight
> > > > implementation farther along.
> > > >
> > > > Thanks,
> > > > Paul
> >
>
>
> --
>
> Paul Dix
>
> Founder & CTO • InfluxData
>
> m 646.283.6377     t @pauldix     w influxdata.com
> <https://www.influxdata.com/>

Reply via email to