[GitHub] drill pull request #1090: DRILL-6080: Sort incorrectly limits batch size to ...

paul-rogers Sat, 27 Jan 2018 13:50:52 -0800

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1090#discussion_r164282435
  
    --- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java
 ---
    @@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture, 
DataGenerator dataGen,
     
       public void runJumboBatchTest(OperatorFixture fixture, int rowCount) {
         timer.reset();
    -    DataGenerator dataGen = new DataGenerator(fixture, rowCount, 
Character.MAX_VALUE);
    -    DataValidator validator = new DataValidator(rowCount, 
Character.MAX_VALUE);
    +    DataGenerator dataGen = new DataGenerator(fixture, rowCount, 
ValueVector.MAX_ROW_COUNT);
    --- End diff --
    
    Thanks Vlad. I double-checked everything.
    
    `testLargeBatch()` passes to `runJumboBatchTest()` the number of records 
per batch. It passes the maximum, which is `ValueVector.MAX_ROW_COUNT`:
    
    ```
          runJumboBatchTest(fixture, ValueVector.MAX_ROW_COUNT);
    ```
    
    Then, `runJumboBatchTest()` sets up a test of *n* rows with *m* rows per 
batch. (This allows testing multiple batches with a small number of batches 
each.) Here, we want one batch with the maximum number of rows, so we pass:
    
    ```
        DataGenerator dataGen = new DataGenerator(fixture, rowCount, 
ValueVector.MAX_ROW_COUNT);
        DataValidator validator = new DataValidator(rowCount, 
ValueVector.MAX_ROW_COUNT);
    ```
    
    That is, a row count of 64K (passed in) in batches of 64K. The result is a 
single batch of the maximum possible row count.
    
    Next, `DataGenerator` records these two numbers. It was a bit forgiving by 
adjusting the maximum number to the allowed maximum:
    
    ```
          this.batchSize = Math.min(batchSize, ValueVector.MAX_ROW_COUNT);
    ```
    
    I see that this causes confusion, and is probably a bit too lenient for a 
test. So, I've enforced the limit instead:
    
    ```
          Preconditions.checkArgument(batchSize > 0 && batchSize <= 
ValueVector.MAX_ROW_COUNT);
          this.batchSize = batchSize;
    ```
    
    I made the same change to the `DataValidator` class.
    
    We also discussed the SV4 class. What may not be entirely obvious there is 
the length of an SV4 can often exceed the batch size limit. We have two 
indirection vectors (AKA "selection vectors"). The SV2 is an indirection on top 
of a single batch (and so has a length limited to 
`ValueVector.MAX_RECORD_COUNT`.) But, the SV4 "glues together" multiple 
batches, and so will have a length limited only by the maximum vector length. 
(Hard coded at 2 GB, but ideally limited to 16 GB.)
    
    Further, as your work showed, we probably don't want a huge SV4 for other 
reasons. A primary use of at the SV4 is to merge a collection if batches to 
produce a set of merged output batches. You showed that merge performance 
suffers above a certain input width, and that width is far below the 
theoretical maximum.
    
    So, I *think* I've got everything covered. If I'm overlooking anything, 
please do point it out so I can fix it.

---

[GitHub] drill pull request #1090: DRILL-6080: Sort incorrectly limits batch size to ...

Reply via email to