vibhatha commented on code in PR #41930:
URL: https://github.com/apache/arrow/pull/41930#discussion_r1625183754
##########
java/c/src/main/java/org/apache/arrow/c/StructVectorLoader.java:
##########
@@ -54,7 +55,12 @@ public class StructVectorLoader {
/**
* Construct with a schema.
- *
+ * <p>
+ * The schema referred to here can be obtained from the struct vector.
+ * Consider a {@link StructVector} structVector, the schema can be obtained
as follows:
+ * <code>
+ * Schema schema = new Schema(structVector.getField().getChildren());
+ * </code>
Review Comment:
Let's take the following route into the written test case
`testVectorLoadUnloadOnStructVector` in `DictionaryTest` in `C` module.
We have the following
```java
try (final StructVector structVector1 = StructVector.empty("struct",
allocator)) {
createStructVector(structVector1);
Field field1 = structVector1.getField();
Schema schema = new Schema(field1.getChildren());
StructVectorUnloader vectorUnloader = new
StructVectorUnloader(structVector1);
try (
ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch();
BufferAllocator finalVectorsAllocator =
allocator.newChildAllocator("struct", 0, Long.MAX_VALUE);
) {
// validating recordBatch contains an output for variadicBufferCounts
assertFalse(recordBatch.getVariadicBufferCounts().isEmpty());
assertEquals(1, recordBatch.getVariadicBufferCounts().size());
assertEquals(1, recordBatch.getVariadicBufferCounts().get(0));
StructVectorLoader vectorLoader = new StructVectorLoader(schema);
try (StructVector structVector2 =
vectorLoader.load(finalVectorsAllocator, recordBatch)) {
// Improve this after fixing
https://github.com/apache/arrow/issues/41933
// assertTrue(VectorEqualsVisitor.vectorEquals(structVector1,
structVector2), "vectors are not equivalent");
assertTrue(VectorEqualsVisitor.vectorEquals(structVector1.getChild("f0"),
structVector2.getChild("f0")),
"vectors are not equivalent");
assertTrue(VectorEqualsVisitor.vectorEquals(structVector1.getChild("f1"),
structVector2.getChild("f1")),
"vectors are not equivalent");
}
}
}
}
```
First we need to create the `StructVectorUnloader`, and we take a look at
`ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch();`
which basically extracts all the metadata into `ArrowRecordBatch`.
```java
public ArrowRecordBatch getRecordBatch() {
List<ArrowFieldNode> nodes = new ArrayList<>();
List<ArrowBuf> buffers = new ArrayList<>();
List<Long> variadicBufferCounts = new ArrayList<>();
for (FieldVector vector : root.getChildrenFromFields()) {
appendNodes(vector, nodes, buffers, variadicBufferCounts);
}
return new ArrowRecordBatch(root.getValueCount(), nodes, buffers,
CompressionUtil.createBodyCompression(codec),
variadicBufferCounts, alignBuffers);
}
```
`root.getChildrenFromFields()` this would return only the children, in our
case
```
0 = {ViewVarCharVector@3494} "[01234567890, 012345678901234567]"
1 = {IntVector@3495} "[10, 11]"
result = {ArrayList@3511} size = 2
```
Thus we will have `2` `nodes`. Then we need `StructVectorLoader vectorLoader
= new StructVectorLoader(schema);`
`StructVectorLoader` to create the `StructVector` back from the
`ArrowRecordBatch`. For the moment, let's forget about
what is required for the `schema` and move forward with the logic and look
into the `load`.
```java
StructVector structVector2 = vectorLoader.load(finalVectorsAllocator,
recordBatch)
```
In there by default we do the following
```java
StructVector result = StructVector.emptyWithDuplicates("", allocator);
result.initializeChildrenFromFields(this.schema.getFields());
```
`result.getField()` would give us `Struct<f0: Utf8View, f1: Int(32, true)>
not null` which is basically the schema we want
(this happens because of our schema selection we will discuss later) for the
output,
though the name from the original struct will be missing since we do
`StructVector.emptyWithDuplicates("", allocator)`,
and reported [here](https://github.com/apache/arrow/issues/41933)
Looking into the load function, i.e.
```java
public StructVector load(BufferAllocator allocator, ArrowRecordBatch
recordBatch) {
StructVector result = StructVector.emptyWithDuplicates("", allocator);
result.initializeChildrenFromFields(this.schema.getFields());
Iterator<ArrowBuf> buffers = recordBatch.getBuffers().iterator();
Iterator<ArrowFieldNode> nodes = recordBatch.getNodes().iterator();
CompressionUtil.CodecType codecType = CompressionUtil.CodecType
.fromCompressionType(recordBatch.getBodyCompression().getCodec());
decompressionNeeded = codecType !=
CompressionUtil.CodecType.NO_COMPRESSION;
CompressionCodec codec = decompressionNeeded ?
factory.createCodec(codecType) : NoCompressionCodec.INSTANCE;
Iterator<Long> variadicBufferCounts = Collections.emptyIterator();
if (recordBatch.getVariadicBufferCounts() != null &&
!recordBatch.getVariadicBufferCounts().isEmpty()) {
variadicBufferCounts =
recordBatch.getVariadicBufferCounts().iterator();
}
for (FieldVector fieldVector : result.getChildrenFromFields()) {
loadBuffers(fieldVector, fieldVector.getField(), buffers, nodes,
codec, variadicBufferCounts);
}
result.loadFieldBuffers(new ArrowFieldNode(recordBatch.getLength(), 0),
Collections.singletonList(null));
if (nodes.hasNext() || buffers.hasNext()) {
throw new IllegalArgumentException("not all nodes and buffers were
consumed. nodes: " +
Collections2.toList(nodes).toString() + " buffers: " +
Collections2.toList(buffers).toString());
}
return result;
}
```
`result.getChildrenFromFields()` would return the following.
```
result = {ArrayList@3616} size = 2
0 = {ViewVarCharVector@3620} "[]"
1 = {IntVector@3621} "[]"
```
Now let's go back to the schema. Re-evaluating the first two lines of the
function
```java
StructVector result = StructVector.emptyWithDuplicates("", allocator);
result.initializeChildrenFromFields(this.schema.getFields());
```
Here the schema for the resultant `StructVector` is obtained from the
`Schema` we passed
to the `StructVectorLoader`. If we pass the `new
Schema(List.of(structVector1.getField()) -> Schema<struct: Struct<f0: Utf8View,
f1: Int(32, true)>>`
instead of `new Schema(structVector1.getField().getChildren())) ->
Schema<f0: Utf8View, f1: Int(32, true)>`, we would have a loader `StructVector`
schema with an additional node which is `StrucVector` itself. But that will
cause a mismatch in the
logic here. Since the loader uses the passes schema to create that `result`
`SturctVector`.
Thus it would fail `checkArgument(nodes.hasNext(), "no more field nodes for
field %s and vector %s", field, vector);`
since we created just 2 nodes, but here we look for 3 as schema would have 3
layers to load buffers from. First `struct`, second
`viewVarCharVector`, and finally 3rd `intVector`.
It took me a while to realize this. In a naive or obvious sense I would
assume I should just pass the Schema of the `StructVector`
to the `StructVectorLoader`. But it seems it is not the case.
Furthermore, I naively tried the following to mitigate that
```java
public ArrowRecordBatch getRecordBatch() {
List<ArrowFieldNode> nodes = new ArrayList<>();
List<ArrowBuf> buffers = new ArrayList<>();
List<Long> variadicBufferCounts = new ArrayList<>();
// commenting to show the diff
// for (FieldVector vector : root.getChildrenFromFields()) {
// appendNodes(vector, nodes, buffers, variadicBufferCounts);
// }
appendNodes(root, nodes, buffers, variadicBufferCounts);
return new ArrowRecordBatch(root.getValueCount(), nodes, buffers,
CompressionUtil.createBodyCompression(codec),
variadicBufferCounts, alignBuffers);
}
```
This would fulfill my expectation (not sure if it is the right one), and it
will pass this test case, if I just pass
the struct's schema. Though it will fail a series of other tests in
`RoundTrip`, `Dictionary` and some other places.
So fixing it as I prefered would be a breaking as far as I felt.
Am I wrong here? Could we do something better?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]