[
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984836#comment-16984836
]
Andy Thomason edited comment on ARROW-5949 at 11/29/19 9:27 AM:
----------------------------------------------------------------
We should discuss the design for a dictionary type and the necessary
serialisation.
For example, start by adding
{code:java}
Dictionary((Box<DataType>, Box<DataType>)),{code}
To DataType (key and value types)
We may not need the extra Schema dictionary field as this is integral in the
DataType.
{code:java}
pub struct DictionaryArray
{
keys: ArrayRef,
values: Vec<ArrayDataRef>,
} {code}
Note that to support multiple dictionary batches, we need a vector of values,
although
in the majority of our use cases, we have only used a single dictionary. An
option
to concatenate dictionaries might be useful.
Access is similar to ListArray except that the index is a variable type. For
example,
we often have a "chromosome" column which is "1", .. "X" and reduces to a byte.
Fast access to dictionary components is essential - returning slices for key
and
value per recordbatch. It would be very useful for all types to have a
rb.get_slice<T>("name") function
to get a named, typed slice for an array.
Andy.
was (Author: andy-thomason):
We should discuss the design for a dictionary type and the necessary
serialisation.
For example, start by adding
Dictionary((Box<DataType>, Box<DataType>)),
To DataType (key and value types)
We may not need the extra Schema dictionary field as this is integral in the
DataType.
{code:java}
pub struct DictionaryArray
{
keys: ArrayRef,
values: Vec<ArrayDataRef>,
} {code}
Note that to support multiple dictionary batches, we need a vector of values,
although
in the majority of our use cases, we have only used a single dictionary. An
option
to concatenate dictionaries might be useful.
Access is similar to ListArray except that the index is a variable type. For
example,
we often have a "chromosome" column which is "1", .. "X" and reduces to a byte.
Fast access to dictionary components is essential - returning slices for key
and
value per recordbatch. It would be very useful for all types to have a
rb.get_slice<T>("name") function
to get a named, typed slice for an array.
Andy.
> [Rust] Implement DictionaryArray
> --------------------------------
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Rust
> Reporter: David Atienza
> Priority: Major
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is
> there any blocker?
>
> The specification is a bit
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding]
> or even
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
> so I am not sure how to implement it myself.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)