On Tue, Apr 21, 2020 at 6:34 AM Yue Ni <niyue....@gmail.com> wrote: > > Hi there, > > I am currently using gandiva C++ library doing projection/selection for > Arrow record batch, in my record batch, I have some fields encoded with > dictionary encoding, I wonder how I can apply gandiva functions for these > dictionary encoded fields. > > Currently, there is no gandiva function having signature supporting > dictionary array, and if I tried using the dictionary array's value type to > compose a gandiva function expression and create a projector, it will > report "Field definition in schema my_field dictionary<values=string, > indices=int8, ordered=0> different from field in expression > my_field:string", which is expected. > > I would like to know how to solve this problem in arrow/gandiva, more > specifically: > 1) Do I need to convert a dictionary array into a non dictionary > encoded array for applying such a projection?
Currently yes > 2) Is there any API in Arrow that allows me to convert a dictionary array > into a non dictionary encoded array easily? Yes, use arrow::compute::Cast with the dense type as the target type > 3) Initially I thought Dictionary Array could be accessed with similar API > like other arrays since dictionary encoding seems to me a mechanism for > organizing the data internally in the array, and I expect I can access the > value in the dictionary array like other normal arrays for example, > dict_array->Value(i), but it turns out users need to use a different API to > access the values in dictionary (get the indices/dictionaries and then > retrieve the value). Because of this API difference, other clients for the > arrow API have to handle dictionary array/normal array differently, is > there any approach/plan to make this transparent to the API clients? There's no plan that I'm aware of, but you are welcome to propose one. > Thanks.