Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/1090#discussion_r164282435
--- Diff:
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java
---
@@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture,
DataGenerator dataGen,
public void runJumboBatchTest(OperatorFixture fixture, int rowCount) {
timer.reset();
- DataGenerator dataGen = new DataGenerator(fixture, rowCount,
Character.MAX_VALUE);
- DataValidator validator = new DataValidator(rowCount,
Character.MAX_VALUE);
+ DataGenerator dataGen = new DataGenerator(fixture, rowCount,
ValueVector.MAX_ROW_COUNT);
--- End diff --
Thanks Vlad. I double-checked everything.
`testLargeBatch()` passes to `runJumboBatchTest()` the number of records
per batch. It passes the maximum, which is `ValueVector.MAX_ROW_COUNT`:
```
runJumboBatchTest(fixture, ValueVector.MAX_ROW_COUNT);
```
Then, `runJumboBatchTest()` sets up a test of *n* rows with *m* rows per
batch. (This allows testing multiple batches with a small number of batches
each.) Here, we want one batch with the maximum number of rows, so we pass:
```
DataGenerator dataGen = new DataGenerator(fixture, rowCount,
ValueVector.MAX_ROW_COUNT);
DataValidator validator = new DataValidator(rowCount,
ValueVector.MAX_ROW_COUNT);
```
That is, a row count of 64K (passed in) in batches of 64K. The result is a
single batch of the maximum possible row count.
Next, `DataGenerator` records these two numbers. It was a bit forgiving by
adjusting the maximum number to the allowed maximum:
```
this.batchSize = Math.min(batchSize, ValueVector.MAX_ROW_COUNT);
```
I see that this causes confusion, and is probably a bit too lenient for a
test. So, I've enforced the limit instead:
```
Preconditions.checkArgument(batchSize > 0 && batchSize <=
ValueVector.MAX_ROW_COUNT);
this.batchSize = batchSize;
```
I made the same change to the `DataValidator` class.
We also discussed the SV4 class. What may not be entirely obvious there is
the length of an SV4 can often exceed the batch size limit. We have two
indirection vectors (AKA "selection vectors"). The SV2 is an indirection on top
of a single batch (and so has a length limited to
`ValueVector.MAX_RECORD_COUNT`.) But, the SV4 "glues together" multiple
batches, and so will have a length limited only by the maximum vector length.
(Hard coded at 2 GB, but ideally limited to 16 GB.)
Further, as your work showed, we probably don't want a huge SV4 for other
reasons. A primary use of at the SV4 is to merge a collection if batches to
produce a set of merged output batches. You showed that merge performance
suffers above a certain input width, and that width is far below the
theoretical maximum.
So, I *think* I've got everything covered. If I'm overlooking anything,
please do point it out so I can fix it.
---