pavibhai commented on code in PR #1482:
URL: https://github.com/apache/orc/pull/1482#discussion_r1181965794
##########
java/core/src/java/org/apache/orc/impl/ParserUtils.java:
##########
@@ -459,10 +459,31 @@ public static ColumnVector[]
findColumnVectors(TypeDescription schema,
boolean isCaseSensitive,
VectorizedRowBatch batch) {
List<String> names = ParserUtils.splitName(source);
- ColumnFinder result = new ColumnFinder(schema, batch, names.size());
+ ColumnFinder result = new ColumnFinder(schema,
removeAcidColumnsFromVectorizedRowBatch(schema, batch), names.size());
findColumn(removeAcid(schema), names, isCaseSensitive, result);
return result.result;
}
+
Review Comment:
I would think we should go a different route. My proposal would be something
like.
If the schema is acid schema then when you are looking up column names
prefix them with `row.` and just use the full schema. This will ensure that we
respect the nesting of the vectors under the struct column vector related to
`row`
##########
java/core/src/test/org/apache/orc/TestOrcFilterContext.java:
##########
@@ -225,4 +230,12 @@ public void testRepeatingVector() {
assertTrue(OrcFilterContext.isNull(vectorBranch, 1));
assertTrue(OrcFilterContext.isNull(vectorBranch, 2));
}
+
+ @Test
+ public void testACIDTable(){
Review Comment:
I would also add an actual filter test with data to ensure that the behavior
is as expected.
##########
java/core/src/java/org/apache/orc/impl/ParserUtils.java:
##########
@@ -459,10 +459,31 @@ public static ColumnVector[]
findColumnVectors(TypeDescription schema,
boolean isCaseSensitive,
VectorizedRowBatch batch) {
List<String> names = ParserUtils.splitName(source);
- ColumnFinder result = new ColumnFinder(schema, batch, names.size());
+ ColumnFinder result = new ColumnFinder(schema,
removeAcidColumnsFromVectorizedRowBatch(schema, batch), names.size());
findColumn(removeAcid(schema), names, isCaseSensitive, result);
return result.result;
}
+
+ private static VectorizedRowBatch
removeAcidColumnsFromVectorizedRowBatch(TypeDescription schema,
VectorizedRowBatch batch) {
+ final int acidColumnOffset = 5;
+ if(SchemaEvolution.checkAcidSchema(schema)) {
+ VectorizedRowBatch vectorizedRowBatch = new
VectorizedRowBatch(batch.cols.length - acidColumnOffset, batch.size);
+ System.arraycopy(batch.selected, 0, vectorizedRowBatch.selected, 0,
vectorizedRowBatch.selected.length);
+ System.arraycopy(batch.projectedColumns, acidColumnOffset,
vectorizedRowBatch.projectedColumns, 0, batch.projectedColumns.length -
acidColumnOffset);
+ vectorizedRowBatch.setPartitionInfo(batch.getDataColumnCount(),
batch.getPartitionColumnCount());
+ vectorizedRowBatch.selectedInUse = batch.selectedInUse;
+ vectorizedRowBatch.endOfFile = batch.endOfFile;
+
+ StructColumnVector structColumnVector =
(StructColumnVector)(batch.cols[acidColumnOffset]);
+ ColumnVector[] columnVectors = new
ColumnVector[structColumnVector.fields.length];
+ for (int i = 0; i < columnVectors.length; i++) {
+ columnVectors[i] = structColumnVector.fields[i];
+ }
+ vectorizedRowBatch.cols = columnVectors;
+ return vectorizedRowBatch;
+ }
+ return batch;
Review Comment:
I might be missing the logic of where we are updating back the changes that
are not referenced example batch.size back to the batch after the filter
processing is complete.
I go back to the earlier comment that we should do this differently.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]