sachouche commented on a change in pull request #1330: DRILL-6147: Adding
Columnar Parquet Batch Sizing functionality
URL: https://github.com/apache/drill/pull/1330#discussion_r198937814
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBinaryReader.java
##########
@@ -90,22 +91,161 @@ public long readFields(long recordsToReadInThisPass)
throws IOException {
recordsReadInCurrentPass = readRecordsInBulk((int)
recordsToReadInThisPass);
}
+ // Publish this information
+ parentReader.readState.setValuesReadInCurrentPass((int)
recordsReadInCurrentPass);
+
+ // Update the stats
parentReader.parquetReaderStats.timeVarColumnRead.addAndGet(timer.elapsed(TimeUnit.NANOSECONDS));
return recordsReadInCurrentPass;
}
private int readRecordsInBulk(int recordsToReadInThisPass) throws
IOException {
- int recordsReadInCurrentPass = -1;
+ int batchNumRecords = recordsToReadInThisPass;
+ List<VarLenColumnBatchStats> columnStats = new
ArrayList<VarLenColumnBatchStats>(columns.size());
+ int prevReadColumns = -1;
+ boolean overflowCondition = false;
+
+ for (VLColumnContainer columnContainer : orderedColumns) {
+ VarLengthColumn<?> columnReader = columnContainer.column;
+
+ // Read the column data
+ int readColumns = columnReader.readRecordsInBulk(batchNumRecords);
+ assert readColumns <= batchNumRecords : "Reader cannot return more
values than requested..";
+
+ if (!overflowCondition) {
+ if (prevReadColumns >= 0 && prevReadColumns != readColumns) {
+ overflowCondition = true;
+ } else {
+ prevReadColumns = readColumns;
+ }
+ }
+
+ // Enqueue this column entry information to handle overflow conditions;
we will not know
+ // whether an overflow happened till all variable length columns have
been processed
+ columnStats.add(new VarLenColumnBatchStats(columnReader.valueVec,
readColumns));
+ // Decrease the number of records to read when a column returns less
records (minimize overflow)
+ if (batchNumRecords > readColumns) {
+ batchNumRecords = readColumns;
+ // it seems this column caused an overflow (higher layer will not ask
for more values than remaining)
+ ++columnContainer.numCausedOverflows;
Review comment:
No this is not the case:
- For the last entry, will have the following state: overflowCondition =
false, prevReadColumns = 200, readColumns = 100
- This means overflowCondition will be changed to true
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services