[
https://issues.apache.org/jira/browse/IMPALA-5004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655956#comment-16655956
]
Sahil Takiar commented on IMPALA-5004:
--------------------------------------
Thinking about this some more, are we confident this is the approach we want to
take? After discussing this patch with Lars a bit, he pointed out that the
reason the Sort latency remains constant is that Impala is sorting the entire
input dataset. I assumed there was some type of limit-pushdown optimization,
but it seems that only applies for the TopN operator.
Some other concerns:
(1) Given that the whole dataset gets sorted during a Sort operator, if the
threshold gets exceed users will experience a massive performance spike for
large datasets; Impala will go from running the TopN operator (which only
processes a subset of the data defined by the limit) to the Sort operator which
sorts the entire input (if the input is huge, this can cause performance to
tank)
(2) While running queries with large limits probably isn't common, there is
nothing stopping a user from defining a limit in the millions of rows. Users
could define a small limit, but big offset. Users could order by a large number
of columns, or they could order by a few columns but select a large number. All
of these could easily exceed the threshold value, which in this case would be
very workload dependent.
(3) If users see issues with TopN, they can always set DISABLE_OUTERMOST_TOPN
to true - this might be a better option than trying to dynamically make the
decision, which could be error prone and cause perf regressions
We could try to be smarter about the decision to use TopN vs. Sort, but it
sounds tricky to get right. We would have to take into account the following
characteristics: sort can spill but TopN cannot, TopN is faster than sort for
certain limits because TopN only has to process a small amount of the data, for
larger limits sort is faster than TopN (we don't know why yet).
Other potential fixes include:
(1) Admission control could only schedule the query if there is sufficient
memory for the TopN operator to run (this suggestion is based on my
understanding of admission control, someone correct me if I am off the mark
here)
(2) Make TopN spillable (IMPALA-3471)
> Switch to sorting node for large TopN queries
> ---------------------------------------------
>
> Key: IMPALA-5004
> URL: https://issues.apache.org/jira/browse/IMPALA-5004
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 2.9.0
> Reporter: Lars Volker
> Assignee: Sahil Takiar
> Priority: Major
>
> As explained by [~tarmstrong] in IMPALA-4995:
> bq. We should also consider switching to the sort operator for large limits.
> This allows it to spill. The memory requirements for TopN also are
> problematic for large limits, since it would allocate large vectors that are
> untracked and also require a large amount of contiguous memory.
> There's already logic to select TopN vs. Sort:
> [planner/SingleNodePlanner.java#L289|https://github.com/apache/incubator-impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L289]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]