[GitHub] sachouche commented on a change in pull request #1330: DRILL-6147: Adding Columnar Parquet Batch Sizing functionality

GitBox Thu, 28 Jun 2018 19:17:04 -0700

sachouche commented on a change in pull request #1330: DRILL-6147: Adding 
Columnar Parquet Batch Sizing functionality
URL: https://github.com/apache/drill/pull/1330#discussion_r198935830


 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/BatchReader.java
 ##########
 @@ -132,16 +123,20 @@ public FixedWidthReader(ReadState readState) {
       super(readState);
     }
 
-    @Override
-    protected long getReadCount(ColumnReader<?> firstColumnStatus) {
-      return Math.min(readState.schema().getRecordsPerBatch(),
-                      firstColumnStatus.columnChunkMetaData.getValueCount() - 
firstColumnStatus.totalValuesRead);
-    }
-
     @Override
     protected int readRecords(ColumnReader<?> firstColumnStatus, long 
recordsToRead) throws Exception {
       readAllFixedFields(recordsToRead);
-      return firstColumnStatus.getRecordsReadInCurrentPass();
+
+      if (firstColumnStatus != null) {
+        
readState.setValuesReadInCurrentPass(firstColumnStatus.getRecordsReadInCurrentPass());
+      } else {
+        // No rows to return if there are no columns
+        readState.setValuesReadInCurrentPass(0);
+      }
+
+      readState.updateCounts((int) recordsToRead);
 
 Review comment:
   I was trying to handle the following case:
   - Assume query attempts to read columns 'a' and 'b'
   - The first few parquet files contain these columns
   - But one of the input file doesn't have 'a' and 'b'
   - The correct behavior is to honor the total number of rows albeit filling 
nulls for the missing columns
   
   After your comment, I did some digging and notice this situation cannot 
arise with a Fixed or Variable length reader:
   - The Parquet record reader catches this occurrence early on and instead 
uses the Mock reader which will implement what I tried to implement in the 
Fixed reader
   - I have removed the added logic (for handling a null firstColumnStatus) and 
added a check that there should be at least one column to process (that is, 
firstColumnStatus should never be null)
   - I am re-running the tests to ensure that my understanding is correct

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] sachouche commented on a change in pull request #1330: DRILL-6147: Adding Columnar Parquet Batch Sizing functionality

Reply via email to