Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/906#discussion_r135373441
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
---
    @@ -152,97 +157,75 @@ public void kill(boolean sendUpstream) {
         }
       }
     
    -  private void releaseAssets() {
    -    container.zeroVectors();
    -  }
    -
    -  private void clearFieldVectorMap() {
    -    for (final ValueVector v : mutator.fieldVectorMap().values()) {
    -      v.clear();
    -    }
    -  }
    -
       @Override
       public IterOutcome next() {
         if (done) {
           return IterOutcome.NONE;
         }
         oContext.getStats().startProcessing();
         try {
    -      try {
    -        injector.injectChecked(context.getExecutionControls(), 
"next-allocate", OutOfMemoryException.class);
    -
    -        currentReader.allocate(mutator.fieldVectorMap());
    -      } catch (OutOfMemoryException e) {
    -        clearFieldVectorMap();
    -        throw UserException.memoryError(e).build(logger);
    -      }
    -      while ((recordCount = currentReader.next()) == 0) {
    +      while (true) {
             try {
    -          if (!readers.hasNext()) {
    -            // We're on the last reader, and it has no (more) rows.
    -            currentReader.close();
    -            releaseAssets();
    -            done = true;  // have any future call to next() return NONE
    -
    -            if (mutator.isNewSchema()) {
    -              // This last reader has a new schema (e.g., we have a 
zero-row
    -              // file or other source).  (Note that some sources have a 
non-
    -              // null/non-trivial schema even when there are no rows.)
    +          injector.injectChecked(context.getExecutionControls(), 
"next-allocate", OutOfMemoryException.class);
    +          currentReader.allocate(mutator.fieldVectorMap());
    +        } catch (OutOfMemoryException e) {
    +          clearFieldVectorMap();
    +          throw UserException.memoryError(e).build(logger);
    +        }
     
    -              container.buildSchema(SelectionVectorMode.NONE);
    -              schema = container.getSchema();
    +        recordCount = currentReader.next();
    +        Preconditions.checkArgument(recordCount >= 0,
    +            "recordCount from RecordReader.next() should not be negative");
     
    -              return IterOutcome.OK_NEW_SCHEMA;
    -            }
    -            return IterOutcome.NONE;
    -          }
    -          // At this point, the reader that hit its end is not the last 
reader.
    +        boolean isNewRegularSchema = mutator.isNewSchema();
    +        // We should skip the reader, when recordCount = 0 && ! 
isNewRegularSchema.
    +        // Add/set implicit column vectors, only when reader gets > 0 row, 
or
    +        // when reader gets 0 row but with a schema with new field added
    +        if (recordCount > 0 || isNewRegularSchema) {
    +          addImplicitVectors();
    +          populateImplicitVectors();
    +        }
     
    -          // If all the files we have read so far are just empty, the 
schema is not useful
    -          if (! hasReadNonEmptyFile) {
    -            container.clear();
    -            clearFieldVectorMap();
    -            mutator.clear();
    -          }
    +        boolean isNewImplicitSchema = mutator.isNewSchema();
    --- End diff --
    
    You are correct that implicit columns such as _filename_ etc are fixed, and 
only _dir0_, _dir1_ columns can change. 
    
    Previously,  I thought in the same way, in terms of how Drill handling 
_a/b/c.csv_,  _a/b/d/e.csv_. That is,  dir2 only exists for the second file, 
but not for the first file, and hence could lead to a schema change. Turns out 
that there is code to align the dir list [1].  As a result, across the record 
readers, the schema for implicit vectors should remain identical. In other 
words, we do not have to consider implicit vectors in terms of schema change 
for scan batch.
    
    In the revised patch, I dropped the isNewImplcitSchema logic, since the 
implicit schema is constant.  Also, I add the checking in ScanBatch's 
constructor, to make sure either the implicit column list is empty, or they 
should reman constant for all record readers, and throw setup exception if 
fails such check. 
    
    1. 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/ColumnExplorer.java#L145-L147
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to