[ 
https://issues.apache.org/jira/browse/IMPALA-5004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659179#comment-16659179
 ] 

Tim Armstrong commented on IMPALA-5004:
---------------------------------------

I agree that IMPALA-3471 in some form is the ideal solution but is a much 
larger chunk of work. Maybe there's a clever way to modify the sorter to trim 
sorted runs to limit+offset that would not be too invasive.

The TopN operator still has to process the whole input. If there are n rows in 
the input and k = min(n, offset + limit), TopN does O(n log k) work and 
requires O(k) memory. Sort does O(n log n) work and requires O(1) memory but 
O(n) combined disk+memory. I think if k is high enough the trade-off in favour 
of SORT is very favourable. Certainly if k == n it's very unfavourable.

WRT to the SORT being faster, I wonder if it's simply because the constant 
factor for the hybrid quick/insertation sort implementation is lower than the 
constant factor for the heap-based Top-N implementations. Maybe that suggests 
that there's room for optimising the TOP-N implementation.

DISABLE_OUTERMOST_TOPN doesn't quite solve the problem since it only applies to 
the outermost select block, not any subqueries (it's also a hidden developer 
option).

If we're concerned about regressions we could consider setting the threshold to 
a conservative value, e.g. 100s of MB. 

> Switch to sorting node for large TopN queries
> ---------------------------------------------
>
>                 Key: IMPALA-5004
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5004
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.9.0
>            Reporter: Lars Volker
>            Assignee: Sahil Takiar
>            Priority: Major
>
> As explained by [~tarmstrong] in IMPALA-4995:
> bq. We should also consider switching to the sort operator for large limits. 
> This allows it to spill. The memory requirements for TopN also are 
> problematic for large limits, since it would allocate large vectors that are 
> untracked and also require a large amount of contiguous memory.
> There's already logic to select TopN vs. Sort: 
> [planner/SingleNodePlanner.java#L289|https://github.com/apache/incubator-impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L289]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to