[jira] [Commented] (IMPALA-5004) Switch to sorting node for large TopN queries

Sahil Takiar (JIRA) Thu, 18 Oct 2018 15:11:15 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-5004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655956#comment-16655956
 ]


Sahil Takiar commented on IMPALA-5004:
--------------------------------------

Thinking about this some more, are we confident this is the approach we want to 
take? After discussing this patch with Lars a bit, he pointed out that the 
reason the Sort latency remains constant is that Impala is sorting the entire 
input dataset. I assumed there was some type of limit-pushdown optimization, 
but it seems that only applies for the TopN operator.

Some other concerns:

(1) Given that the whole dataset gets sorted during a Sort operator, if the 
threshold gets exceed users will experience a massive performance spike for 
large datasets; Impala will go from running the TopN operator (which only 
processes a subset of the data defined by the limit) to the Sort operator which 
sorts the entire input (if the input is huge, this can cause performance to 
tank)

(2) While running queries with large limits probably isn't common, there is 
nothing stopping a user from defining a limit in the millions of rows. Users 
could define a small limit, but big offset. Users could order by a large number 
of columns, or they could order by a few columns but select a large number. All 
of these could easily exceed the threshold value, which in this case would be 
very workload dependent.

(3) If users see issues with TopN, they can always set DISABLE_OUTERMOST_TOPN 
to true - this might be a better option than trying to dynamically make the 
decision, which could be error prone and cause perf regressions

We could try to be smarter about the decision to use TopN vs. Sort, but it 
sounds tricky to get right. We would have to take into account the following 
characteristics: sort can spill but TopN cannot, TopN is faster than sort for 
certain limits because TopN only has to process a small amount of the data, for 
larger limits sort is faster than TopN (we don't know why yet).

Other potential fixes include:

(1) Admission control could only schedule the query if there is sufficient 
memory for the TopN operator to run (this suggestion is based on my 
understanding of admission control, someone correct me if I am off the mark 
here)

(2) Make TopN spillable (IMPALA-3471)

> Switch to sorting node for large TopN queries
> ---------------------------------------------
>
>                 Key: IMPALA-5004
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5004
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.9.0
>            Reporter: Lars Volker
>            Assignee: Sahil Takiar
>            Priority: Major
>
> As explained by [~tarmstrong] in IMPALA-4995:
> bq. We should also consider switching to the sort operator for large limits. 
> This allows it to spill. The memory requirements for TopN also are 
> problematic for large limits, since it would allocate large vectors that are 
> untracked and also require a large amount of contiguous memory.
> There's already logic to select TopN vs. Sort: 
> [planner/SingleNodePlanner.java#L289|https://github.com/apache/incubator-impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L289]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-5004) Switch to sorting node for large TopN queries

Reply via email to