[ https://issues.apache.org/jira/browse/ARROW-10174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-10174: ----------------------------------- Labels: pull-request-available (was: ) > [Java] Reading of Dictionary encoded struct vector fails > --------------------------------------------------------- > > Key: ARROW-10174 > URL: https://issues.apache.org/jira/browse/ARROW-10174 > Project: Apache Arrow > Issue Type: Bug > Components: Java > Affects Versions: 1.0.1 > Reporter: Benjamin Wilhelm > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Write an index vector and a dictionary with a dictionary vector of the type > {{Struct}} using an {{ArrowStreamWriter}}. Reading this again fails with an > exception. > Code to reproduce: > {code:java} > final RootAllocator allocator = new RootAllocator(); > // Create the dictionary > final StructVector dict = StructVector.empty("Dict", allocator); > final NullableStructWriter dictWriter = dict.getWriter(); > final IntWriter dictA = dictWriter.integer("a"); > final IntWriter dictB = dictWriter.integer("b"); > for (int i = 0; i < 3; i++) { > dictWriter.start(); > dictA.writeInt(i); > dictB.writeInt(i); > dictWriter.end(); > } > dict.setValueCount(3); > final Dictionary dictionary = new Dictionary(dict, new DictionaryEncoding(1, > false, null)); > // Create the vector > final Random random = new Random(); > final StructVector vector = StructVector.empty("Dict", allocator); > final NullableStructWriter vectorWriter = vector.getWriter(); > final IntWriter vectorA = vectorWriter.integer("a"); > final IntWriter vectorB = vectorWriter.integer("b"); > for (int i = 0; i < 10; i++) { > int v = random.nextInt(3); > vectorWriter.start(); > vectorA.writeInt(v); > vectorB.writeInt(v); > vectorWriter.end(); > } > vector.setValueCount(10); > // Encode the vector using the dictionary > final IntVector indexVector = (IntVector) DictionaryEncoder.encode(vector, > dictionary); > // Write the vector to out > final ByteArrayOutputStream out = new ByteArrayOutputStream(); > final VectorSchemaRoot root = new > VectorSchemaRoot(Collections.singletonList(indexVector.getField()), > Collections.singletonList(indexVector)); > final ArrowStreamWriter writer = new ArrowStreamWriter(root, new > MapDictionaryProvider(dictionary), > Channels.newChannel(out)); > writer.start(); > writer.writeBatch(); > writer.end(); > // Read the vector from out > try (final ArrowStreamReader reader = new ArrowStreamReader(new > ByteArrayInputStream(out.toByteArray()), > allocator)) { > reader.loadNextBatch(); > final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot(); > final FieldVector readIndexVector = readRoot.getVector(0); > // Get the dictionary and decode > final Map<Long, Dictionary> readDictionaryMap = > reader.getDictionaryVectors(); > final Dictionary readDictionary = > readDictionaryMap.get(readIndexVector.getField().getDictionary().getId()); > final ValueVector readVector = > DictionaryEncoder.decode(readIndexVector, readDictionary); > } > {code} > Exception: > {code} > java.lang.IllegalArgumentException: not all nodes and buffers were consumed. > nodes: [ArrowFieldNode [length=3, nullCount=0], ArrowFieldNode [length=3, > nullCount=0]] buffers: [ArrowBuf[21], address:140118352739688, length:1, > ArrowBuf[22], address:140118352739696, length:12, ArrowBuf[23], > address:140118352739712, length:1, ArrowBuf[24], address:140118352739720, > length:12] > at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:63) > at org.apache.arrow.vector.ipc.ArrowReader.load(ArrowReader.java:241) > at > org.apache.arrow.vector.ipc.ArrowReader.loadDictionary(ArrowReader.java:232) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:129) > at com.knime.AppTest.testDictionaryStruct(AppTest.java:83) > {code} > If I see it corretly the error happens in > {{DictionaryUtilities#toMessageFormat}}. If a dictionary encoded vector is > encountered still the children of the memory format field are used (none > because this is Int). However, the children of the field of the dictionary > vector should be mapped to the message format and set as children. > I can create a fix and open a pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005)