[
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149825#comment-16149825
]
Paul Rogers commented on DRILL-5758:
------------------------------------
Turns out the {{RecordBatchSizer}} contained a bug for repeated elements.
Consider the original output:
{code}
rms.mapvalue.col2(type: REPEATED BIGINT, count: 1, total entries: 1,
per-array: 1, std size: 8, actual size: 52, data size: 52)
...
Records: 4096, Total size: 1441792, Data size: 376615, Gross row width: 352,
Net row width: 92, Density: 27}
{code}
In the above, {{col2}} is repeated, but the entries per array is set at 1.
Output after the fix:
{code}
rms.mapvalue.col2(type: REPEATED BIGINT, count: 4096, elements: 12288,
per-array: 3, std size: 8, actual size: 28, data size: 114688)
...
Records: 4096, Total size: 1441792, Data size: 1136848, Gross row width: 352,
Net row width: 278, Density: 79}
{code}
Note that the (average) elements per-array is now 3 and the estimated "net" row
width has grown from 92 to 278.
The result is much better vector size estimates and no vector reallocations:
{code}
Initial output batch allocation: 811008 bytes, 3771 records
<Note no vector resizes here.>
Took 4438 us to merge 3771 records, consuming 811008 bytes of memory
{code}
And now the sort completes:
{code}
Results: 4,000,000 records, 63 batches
{code}
> Rollup of external sort fixes to issues found by QA
> ---------------------------------------------------
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
> Issue Type: Task
> Affects Versions: 1.12.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA
> entries. Bugs fixed in this task are given by the linked issues.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)