On Wed, 2021-08-25 at 21:02 +0300, roee shlomo wrote: > This means that an API to import an ArrowSchema (in C) into a > Field/Schema > (in Java) is not suitable for dictionary encoded arrays because there > is an > information loss. Specifically, there is nothing in Field/Schema to > indicate the value type as far as we can tell.
I think maybe IPC's code can be reference here: 1. (C++) Serialization of field with dictionary: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata_internal.cc#L696-L735 2. (Java) Deserialization of field with dictionary: https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L133-L177 And this piece of code shows how Java Arrow schema organizes dict index type and value type: https://github.com/apache/arrow/blob/5003278ded77f1ab385425143aafd085fda1f701/java/vector/src/test/java/org/apache/arrow/vector/ipc/MessageSerializerTest.java#L143-L155 > Even if that were solved, importing dictionary encoded arrays is too > complex from a user point of view. We would need to import both the > vector > and a dictionary provider (i.e. multiple return values in some cases) > and > the user would be responsible for taking ownership of every vector in > the > dictionary provider and eventually closing it. This adds a lot of > complexity for cases like importing ArrowArray (C) into an existing > VectorSchemaRoot (when importing in batches). If VectorSchemaRoot doesn't cooperate here, would it be an option to have another API to export/import via Java ArrowRecordBatch/ArrowDictionaryBatch or some sort of composite buffer- based structure, which doesn't utilize Java Vector facilities at all? Users would always be able to have these buffers loaded via VectorLoader/VectorSchemaRoot/ArrowReader by themselves with an already imported schema object. Best, Hongze
