We are currently implementing the C Data Interface in Java and have some questions regarding dictionary-encoded arrays. We would appreciate some help and guidance, especially from an API perspective.
In Java, the dictionary vector is completely separate from the encoded vector. Typically, a DictionaryProvider is available alongside a dictionary encoded vector (to provide dictionaries for the vector and its children). On the other hand, the C Data Interface bundles the dictionary into the array. This means that an API to import an ArrowSchema (in C) into a Field/Schema (in Java) is not suitable for dictionary encoded arrays because there is an information loss. Specifically, there is nothing in Field/Schema to indicate the value type as far as we can tell. Even if that were solved, importing dictionary encoded arrays is too complex from a user point of view. We would need to import both the vector and a dictionary provider (i.e. multiple return values in some cases) and the user would be responsible for taking ownership of every vector in the dictionary provider and eventually closing it. This adds a lot of complexity for cases like importing ArrowArray (C) into an existing VectorSchemaRoot (when importing in batches). We tried to follow the same API as the C++ implementation, but with dictionaries we cannot keep the same API. The complexity is mostly in the imports. Is there any pattern that we should follow? Help would be much appreciated. Our proposed API, without dictionary support, is in https://github.com/roee88/arrow-java-ffi/blob/main/src/main/java/org/apache/arrow/ffi/FFI.java . Furthermore, in Java it seems like export/import of dictionaries independently of the vectors would avoid passing the same dictionary values multiple times (e.g., when sending batches). What was the motivation for flattening the dictionaries? We would like to submit a PR without dictionary support first and mark the API as experimental. We would like to address dictionary support separately, with the help of the community. Is that acceptable? Thank you very much.