Paul Rogers created DRILL-5635:
----------------------------------
Summary: External sort need not buffer two incoming batches; just
two spill files' worth
Key: DRILL-5635
URL: https://issues.apache.org/jira/browse/DRILL-5635
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.10.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Priority: Minor
The external sort has long had the rule that it must have space in memory to
buffer two incoming batches. This rule stems from the merge/sort algorithm that
requires a minimum of two *records* per file to make progress.
But, Drill is batch-based, not record-based. Drill must spill at least two
records (preferably much more) to make progress.
The original code confused these issues and required room for two incoming
batches. This restriction can be softened. In the worst case, Drill only needs
two records: either one batch of (at least) two records, or two batches of one
record each.
The difference in rule will better handle the case of very large incoming
batches relative to memory available for sort. Suppose an incoming batch is 500
MB in size and the sort has 750 MB available. The code today fails the sort
because two batches cannot fit in memory. But, the sort, by default, creates
spill files of 250 MB. Since a single input batch can fill two spill files,
there is no reason to wait for a second batch before spilling.
Special care is needed to handle this case. Divide the "jumbo" batch into spill
files of near-equal size, but each must contain at least two records.
The only limit occurs if individual records (assuming one record per batch)
exceed 25% of memory. (Because, today, the sort copies records prior to
spilling. Another JIRA will address that fault.)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)