We are currently implementing the C Data Interface in Java and have some
questions regarding dictionary-encoded arrays. We would appreciate some
help and guidance, especially from an API perspective.

In Java, the dictionary vector is completely separate from the encoded
vector. Typically, a DictionaryProvider is available alongside a dictionary
encoded vector (to provide dictionaries for the vector and its children).
On the other hand, the C Data Interface bundles the dictionary into the
array.

This means that an API to import an ArrowSchema (in C) into a Field/Schema
(in Java) is not suitable for dictionary encoded arrays because there is an
information loss. Specifically, there is nothing in Field/Schema to
indicate the value type as far as we can tell.

Even if that were solved, importing dictionary encoded arrays is too
complex from a user point of view. We would need to import both the vector
and a dictionary provider (i.e. multiple return values in some cases) and
the user would be responsible for taking ownership of every vector in the
dictionary provider and eventually closing it. This adds a lot of
complexity for cases like importing ArrowArray (C) into an existing
VectorSchemaRoot (when importing in batches).

We tried to follow the same API as the C++ implementation, but with
dictionaries we cannot keep the same API. The complexity is mostly in the
imports. Is there any pattern that we should follow? Help would be much
appreciated. Our proposed API, without dictionary support, is in
https://github.com/roee88/arrow-java-ffi/blob/main/src/main/java/org/apache/arrow/ffi/FFI.java
.

Furthermore, in Java it seems like export/import of dictionaries
independently of the vectors would avoid passing the same dictionary values
multiple times (e.g., when sending batches). What was the motivation for
flattening the dictionaries?

We would like to submit a PR without dictionary support first and mark the
API as experimental. We would like to address dictionary support
separately, with the help of the community. Is that acceptable?

Thank you very much.

Reply via email to