[
https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336951#comment-16336951
]
ASF GitHub Bot commented on DRILL-6080:
---------------------------------------
Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/1090#discussion_r163456045
--- Diff:
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java
---
@@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture,
DataGenerator dataGen,
public void runJumboBatchTest(OperatorFixture fixture, int rowCount) {
timer.reset();
- DataGenerator dataGen = new DataGenerator(fixture, rowCount,
Character.MAX_VALUE);
- DataValidator validator = new DataValidator(rowCount,
Character.MAX_VALUE);
+ DataGenerator dataGen = new DataGenerator(fixture, rowCount,
ValueVector.MAX_ROW_COUNT);
--- End diff --
Well... As it turns out, `ValueVector.MAX_ROW_COUNT` is 64K, which the the
maximum size an SV2 can address. (An SV2 is 16 bits wide.) `Integer.MAX_VALUE`
is 2^32, which would require a 32-bit SV2, which we don't have. So, using the
`Integer.MAX_VALUE` would cause the test to fail as the sorter could not sort
batches larger than 64K...
Prior we used to use `Character.MAX_VALUE`, but it is not intuitively
obvious that our batch size should be correlated to the size of Java's UTF-16
character encoding... And, in fact, the original bug is that they are not
correlated: `Character.MAX_VALUE` is 65535, while `ValueVector.MAX_ROWS` is
65536. As a result, we were not testing the full-batch corner case.
> Sort incorrectly limits batch size to 65535 records rather than 65536
> ---------------------------------------------------------------------
>
> Key: DRILL-6080
> URL: https://issues.apache.org/jira/browse/DRILL-6080
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.12.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Minor
> Fix For: 1.13.0
>
>
> Drill places an upper limit on the number of rows in a batch of 64K. That is
> 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0
> to 65,535.
> The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row
> count. So, if an incoming batch uses the full 64K size, sort ends up
> splitting batches unnecessarily.
> The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)