[
https://issues.apache.org/jira/browse/IMPALA-5004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659179#comment-16659179
]
Tim Armstrong commented on IMPALA-5004:
---------------------------------------
I agree that IMPALA-3471 in some form is the ideal solution but is a much
larger chunk of work. Maybe there's a clever way to modify the sorter to trim
sorted runs to limit+offset that would not be too invasive.
The TopN operator still has to process the whole input. If there are n rows in
the input and k = min(n, offset + limit), TopN does O(n log k) work and
requires O(k) memory. Sort does O(n log n) work and requires O(1) memory but
O(n) combined disk+memory. I think if k is high enough the trade-off in favour
of SORT is very favourable. Certainly if k == n it's very unfavourable.
WRT to the SORT being faster, I wonder if it's simply because the constant
factor for the hybrid quick/insertation sort implementation is lower than the
constant factor for the heap-based Top-N implementations. Maybe that suggests
that there's room for optimising the TOP-N implementation.
DISABLE_OUTERMOST_TOPN doesn't quite solve the problem since it only applies to
the outermost select block, not any subqueries (it's also a hidden developer
option).
If we're concerned about regressions we could consider setting the threshold to
a conservative value, e.g. 100s of MB.
> Switch to sorting node for large TopN queries
> ---------------------------------------------
>
> Key: IMPALA-5004
> URL: https://issues.apache.org/jira/browse/IMPALA-5004
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 2.9.0
> Reporter: Lars Volker
> Assignee: Sahil Takiar
> Priority: Major
>
> As explained by [~tarmstrong] in IMPALA-4995:
> bq. We should also consider switching to the sort operator for large limits.
> This allows it to spill. The memory requirements for TopN also are
> problematic for large limits, since it would allocate large vectors that are
> untracked and also require a large amount of contiguous memory.
> There's already logic to select TopN vs. Sort:
> [planner/SingleNodePlanner.java#L289|https://github.com/apache/incubator-impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L289]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]