Re: [PR] GH-41902: [Java] Variadic Buffer Counts Incorrect [arrow]

via GitHub Mon, 03 Jun 2024 17:18:21 -0700


vibhatha commented on code in PR #41930:
URL: https://github.com/apache/arrow/pull/41930#discussion_r1625183754



##########
java/c/src/main/java/org/apache/arrow/c/StructVectorLoader.java:
##########
@@ -54,7 +55,12 @@ public class StructVectorLoader {
 
   /**
    * Construct with a schema.
-   *
+   * <p>
+   * The schema referred to here can be obtained from the struct vector.
+   * Consider a {@link StructVector} structVector, the schema can be obtained 
as follows:
+   * <code>
+   * Schema schema = new Schema(structVector.getField().getChildren());
+   * </code>

Review Comment:
   Let's take the following route into the written test case 
`testVectorLoadUnloadOnStructVector` in `DictionaryTest` in `C` module. 
   
   We have the following 
   
   ```java
    try (final StructVector structVector1 = StructVector.empty("struct", 
allocator)) {
         createStructVector(structVector1);
         Field field1 = structVector1.getField();
         Schema schema = new Schema(field1.getChildren());
         StructVectorUnloader vectorUnloader = new 
StructVectorUnloader(structVector1);
   
         try (
             ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch();
             BufferAllocator finalVectorsAllocator = 
allocator.newChildAllocator("struct", 0, Long.MAX_VALUE);
         ) {
           // validating recordBatch contains an output for variadicBufferCounts
           assertFalse(recordBatch.getVariadicBufferCounts().isEmpty());
           assertEquals(1, recordBatch.getVariadicBufferCounts().size());
           assertEquals(1, recordBatch.getVariadicBufferCounts().get(0));
   
           StructVectorLoader vectorLoader = new StructVectorLoader(schema);
           try (StructVector structVector2 = 
vectorLoader.load(finalVectorsAllocator, recordBatch)) {
             // Improve this after fixing 
https://github.com/apache/arrow/issues/41933
             // assertTrue(VectorEqualsVisitor.vectorEquals(structVector1, 
structVector2), "vectors are not equivalent");
             
assertTrue(VectorEqualsVisitor.vectorEquals(structVector1.getChild("f0"), 
structVector2.getChild("f0")),
                 "vectors are not equivalent");
             
assertTrue(VectorEqualsVisitor.vectorEquals(structVector1.getChild("f1"), 
structVector2.getChild("f1")),
                 "vectors are not equivalent");
           }
         }
       }
    }
   ```
   
   First we need to create the `StructVectorUnloader`, and we take a look at 
`ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch();`
   which basically extracts all the metadata into `ArrowRecordBatch`. 
   
   ```java
   public ArrowRecordBatch getRecordBatch() {
       List<ArrowFieldNode> nodes = new ArrayList<>();
       List<ArrowBuf> buffers = new ArrayList<>();
       List<Long> variadicBufferCounts = new ArrayList<>();
       for (FieldVector vector : root.getChildrenFromFields()) {
         appendNodes(vector, nodes, buffers, variadicBufferCounts);
       }
       return new ArrowRecordBatch(root.getValueCount(), nodes, buffers, 
CompressionUtil.createBodyCompression(codec),
           variadicBufferCounts, alignBuffers);
     }
   ```
   
   `root.getChildrenFromFields()` this would return only the children, in our 
case 
   
   ```
    0 = {ViewVarCharVector@3494} "[01234567890, 012345678901234567]"
    1 = {IntVector@3495} "[10, 11]"
   result = {ArrayList@3511}  size = 2
   ```
   
   Thus we will have `2` `nodes`. Then we need `StructVectorLoader vectorLoader 
= new StructVectorLoader(schema);`
   `StructVectorLoader` to create the `StructVector` back from the 
`ArrowRecordBatch`. For the moment, let's forget about
   what is required for the `schema` and move forward with the logic and look 
into the `load`.
   
   ```java
   StructVector structVector2 = vectorLoader.load(finalVectorsAllocator, 
recordBatch)
   ```
   
   In there by default we do the following
   
   ```java
   StructVector result = StructVector.emptyWithDuplicates("", allocator);
   result.initializeChildrenFromFields(this.schema.getFields());
   ```
   
   `result.getField()` would give us `Struct<f0: Utf8View, f1: Int(32, true)> 
not null` which is basically the schema we want 
   (this happens because of our schema selection we will discuss later) for the 
output, 
   though the name from the original struct will be missing since we do 
`StructVector.emptyWithDuplicates("", allocator)`, 
   and reported [here](https://github.com/apache/arrow/issues/41933)
   
   Looking into the load function, i.e.
   
   ```java
   public StructVector load(BufferAllocator allocator, ArrowRecordBatch 
recordBatch) {
       StructVector result = StructVector.emptyWithDuplicates("", allocator);
       result.initializeChildrenFromFields(this.schema.getFields());
   
       Iterator<ArrowBuf> buffers = recordBatch.getBuffers().iterator();
       Iterator<ArrowFieldNode> nodes = recordBatch.getNodes().iterator();
       CompressionUtil.CodecType codecType = CompressionUtil.CodecType
           .fromCompressionType(recordBatch.getBodyCompression().getCodec());
       decompressionNeeded = codecType != 
CompressionUtil.CodecType.NO_COMPRESSION;
       CompressionCodec codec = decompressionNeeded ? 
factory.createCodec(codecType) : NoCompressionCodec.INSTANCE;
       Iterator<Long> variadicBufferCounts = Collections.emptyIterator();
       if (recordBatch.getVariadicBufferCounts() != null && 
!recordBatch.getVariadicBufferCounts().isEmpty()) {
         variadicBufferCounts = 
recordBatch.getVariadicBufferCounts().iterator();
       }
       for (FieldVector fieldVector : result.getChildrenFromFields()) {
         loadBuffers(fieldVector, fieldVector.getField(), buffers, nodes, 
codec, variadicBufferCounts);
       }
       result.loadFieldBuffers(new ArrowFieldNode(recordBatch.getLength(), 0), 
Collections.singletonList(null));
       if (nodes.hasNext() || buffers.hasNext()) {
         throw new IllegalArgumentException("not all nodes and buffers were 
consumed. nodes: " + 
           Collections2.toList(nodes).toString() + " buffers: " + 
Collections2.toList(buffers).toString());
       }
       return result;
     }
   ```
   
   `result.getChildrenFromFields()` would return the following.
   
   ```
   result = {ArrayList@3616}  size = 2
    0 = {ViewVarCharVector@3620} "[]"
    1 = {IntVector@3621} "[]"
   ```
   
   Now let's go back to the schema. Re-evaluating the first two lines of the 
function
   
   ```java
   StructVector result = StructVector.emptyWithDuplicates("", allocator);
   result.initializeChildrenFromFields(this.schema.getFields());
   ```
   
   Here the schema for the resultant `StructVector` is obtained from the 
`Schema` we passed
   to the `StructVectorLoader`. If we pass the `new 
Schema(List.of(structVector1.getField()) -> Schema<struct: Struct<f0: Utf8View, 
f1: Int(32, true)>>`
   instead of `new Schema(structVector1.getField().getChildren())) -> 
Schema<f0: Utf8View, f1: Int(32, true)>`, we would have a loader `StructVector`
   schema with an additional node which is `StrucVector` itself. But that will 
cause a mismatch in the
   logic here. Since the loader uses the passes schema to create that `result` 
`SturctVector`.
   Thus it would fail `checkArgument(nodes.hasNext(), "no more field nodes for 
field %s and vector %s", field, vector);` 
   since we created just 2 nodes, but here we look for 3 as schema would have 3 
layers to load buffers from. First `struct`, second 
   `viewVarCharVector`, and finally 3rd `intVector`. 
   
   It took me a while to realize this. In a naive or obvious sense I would 
assume I should just pass the Schema of the `StructVector`
   to the `StructVectorLoader`. But it seems it is not the case. 
   
   Furthermore, I naively tried the following to mitigate that
   
   ```java
   public ArrowRecordBatch getRecordBatch() {
       List<ArrowFieldNode> nodes = new ArrayList<>();
       List<ArrowBuf> buffers = new ArrayList<>();
       List<Long> variadicBufferCounts = new ArrayList<>();
       // commenting to show the diff
       // for (FieldVector vector : root.getChildrenFromFields()) {
       //  appendNodes(vector, nodes, buffers, variadicBufferCounts);
       // }
       appendNodes(root, nodes, buffers, variadicBufferCounts);
       return new ArrowRecordBatch(root.getValueCount(), nodes, buffers, 
CompressionUtil.createBodyCompression(codec),
           variadicBufferCounts, alignBuffers);
     }
   ```
   
   This would fulfill my expectation (not sure if it is the right one), and it 
will pass this test case, if I just pass
   the struct's schema. Though it will fail a series of other tests in 
`RoundTrip`, `Dictionary` and some other places. 
   So fixing it as I prefered would be a breaking as far as I felt. 
   
   Am I wrong here? Could we do something better? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-41902: [Java] Variadic Buffer Counts Incorrect [arrow]

Reply via email to