I managed to get something up and running. I ended up creating a
dictionary_batch.rs and adding that to convert.rs to translate dictionary
fields in a schema over to the correct fb thing. I also added a method to
writer.rs to convert that to bytes so it can be sent via ipc. However, when
writing out the subsequent record batches that contain a dictionary field
that references the dictionary sent in the dictionary batch, it runs into a
problem. The Rust implementation of DictionaryArray keeps the values as a
child to the keys array.

You can see the chain of calls when calling finish on the
StringDictionaryBuilder
https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
which calls out to finish the dictionary
https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
which then calls from to create the DictionaryArray
https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
And that last part validates that the child is there.

And here in the writer is where it blindly writes out any children:
https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378

I found that when I sent this out, the Python reader on the other side
would choke on it when I tried to read_pandas. However, if I skip
serializing the children for DictionaryArray only, then everything works
exactly as expected.

So I'm not sure what the right thing here is because I don't know if
DictionaryArrays should not include the children (i.e. values) when being
written out or if there's some other thing that should be happening. Should
the raw arrow data have the values in that dictionary? Or does it get
reconstituted manually on the other side from the DictionaryBatch?

On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <wesmck...@gmail.com> wrote:

> As another item for consideration -- in C++ at least, the dictionary
> id is dealt with as an internal detail of the IPC message production
> process. When serializing the Schema, id's are assigned to each
> dictionary-encoded field in the DictionaryMemo object, see
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
>
> When record batches are reconstructed, the dictionary corresponding to
> an id at the time of reconstruction is set in the Array's internal
> data -- that's the "dictionary" member of the ArrayData object
> (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
>
> On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > hey Paul,
> >
> > Take a look at how dictionaries work in the IPC protocol
> >
> >
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> >
> > Dictionaries are sent as separate messages. When a field is tagged as
> > dictionary encoded in the schema, the IPC reader must keep track of
> > the dictionaries it's seen come across the protocol and then set them
> > in the reconstructed record batches when a record batch comes through.
> >
> > Note that the protocol now supports dictionary deltas (dictionaries
> > can be appended to by subsequent messages for the same dictionary id)
> > and replacements (new dictionary for an id).
> >
> > I don't know what the status of handling dictionaries in the Rust IPC,
> > but it would be a good idea to take time to take into account the
> > above details.
> >
> > Finally, note that Rust is not participating in either the regular IPC
> > nor Flight integration tests. This is an important milestone to being
> > able to depend on the Rust library in production.
> >
> > Thanks
> > Wes
> >
> > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <p...@influxdata.com> wrote:
> > >
> > > Hello,
> > > I'm trying to build a Rust based Flight server and I'd like to use
> > > Dictionary encoding for a number of string columns in my data. I've
> seen
> > > that StringDictionary was recently added to Rust here:
> > >
> https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab
> .
> > >
> > > However, that doesn't seem to reach down into Flight. When I attempt to
> > > send a schema through flight that has a Dictionary<UInt8, Utf8> it
> throws
> > > an error when attempting to convert from the Rust type to the
> Flatbuffer
> > > field type. I figured I'd take a swing at adding that to convert.rs
> here:
> > >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> > >
> > > However, when I look at the definitions in Schema.fbs and the related
> > > generated Rust file, Dictionary isn't a type there. Should I be sending
> > > this down as some other composed type? And if so, how does this look
> at the
> > > client side of things? In my test I'm connecting to the Flight server
> via
> > > PyArrow and working with it in Pandas so I'm hoping that it will be
> able to
> > > consume Dictionary fields.
> > >
> > > Separately, the Rust field type doesn't have a spot for the dictionary
> ID,
> > > which I assume I'll need to send down so it can be consumed on the
> client.
> > > Would appreciate any thoughts on that. A little push in the right
> direction
> > > and I'll be happy to submit a PR to help push the Rust Flight
> > > implementation farther along.
> > >
> > > Thanks,
> > > Paul
>


-- 

Paul Dix

Founder & CTO • InfluxData

m 646.283.6377     t @pauldix     w influxdata.com
<https://www.influxdata.com/>

Reply via email to