Hi roee, It seems that we have both raw value and encoded value types in the Java implementation, so there is no information loss?
In particular, we have org.apache.arrow.vector.types.pojo.FieldType#type for the raw type and org.apache.arrow.vector.types.pojo.FieldType#dictionary#indexType for the encoded type. Best, Liya Fan On Thu, Aug 26, 2021 at 10:09 AM Hongze Zhang <notify...@126.com> wrote: > On Wed, 2021-08-25 at 21:02 +0300, roee shlomo wrote: > > > This means that an API to import an ArrowSchema (in C) into a > > Field/Schema > > (in Java) is not suitable for dictionary encoded arrays because there > > is an > > information loss. Specifically, there is nothing in Field/Schema to > > indicate the value type as far as we can tell. > > I think maybe IPC's code can be reference here: > > 1. (C++) Serialization of field with dictionary: > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata_internal.cc#L696-L735 > > 2. (Java) Deserialization of field with dictionary: > > https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L133-L177 > > And this piece of code shows how Java Arrow schema organizes dict index > type and value type: > > https://github.com/apache/arrow/blob/5003278ded77f1ab385425143aafd085fda1f701/java/vector/src/test/java/org/apache/arrow/vector/ipc/MessageSerializerTest.java#L143-L155 > > > > Even if that were solved, importing dictionary encoded arrays is too > > complex from a user point of view. We would need to import both the > > vector > > and a dictionary provider (i.e. multiple return values in some cases) > > and > > the user would be responsible for taking ownership of every vector in > > the > > dictionary provider and eventually closing it. This adds a lot of > > complexity for cases like importing ArrowArray (C) into an existing > > VectorSchemaRoot (when importing in batches). > > If VectorSchemaRoot doesn't cooperate here, would it be an option to > have another API to export/import via Java > ArrowRecordBatch/ArrowDictionaryBatch or some sort of composite buffer- > based structure, which doesn't utilize Java Vector facilities at all? > Users would always be able to have these buffers loaded via > VectorLoader/VectorSchemaRoot/ArrowReader by themselves with an already > imported schema object. > > Best, > Hongze > >