[ 
https://issues.apache.org/jira/browse/IMPALA-5004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654154#comment-16654154
 ] 

Sahil Takiar commented on IMPALA-5004:
--------------------------------------

Did a bit of benchmarking to find a good default threshold and it looks like 
96MB is a good value. Here is how I came to that value:

First, I generated some dummy data, essentially a table with 50,000,000 rows 
with a bunch of random integers.
{code:java}
use tpcds_parquet;
create table tmp stored as parquet as select cast(rand() * 1000000000 as 
bigint) as col1, cast(rand() * 1000000000 as bigint) as col2
from store_returns, store_sales where store_returns.sr_item_sk = 
store_sales.ss_item_sk limit 50000000;
{code}
Then I ran the following query with various values for the offset:
{code:java}
 select * from tmp order by col1, col2 limit 100 offset ? {code}
I measured the latency for the query with varying offsets when either a topn 
operator was used or a sort operator.
||Offset||TopN Latency(s)||Sort Latency (s)||
|1000000|8|34|
|2000000|14|34|
|3000000|20|35|
|4000000|25|35|
|5000000|30|35|
|6000000|36|35|
|7000000|39|36|
|8000000|43|36|
|9000000|46|37|
|10000000|50|36|

I confirmed that there was no memory starvation for any of the queries, and 
that the Sort operator never spilled to disk.

So it seems that after some point, the sort operator is faster than the TopN 
operator. I'm not entirely sure why, perhaps Tim's comment above is true. It's 
also a bit weird that the sort latency almost entirely stays the same, whereas 
TopN latency scales linearly with the offset value.

Regardless, setting {{topn_bytes_limit}} to a value of 96 mb causes the planner 
to switch from TopN to Sort somewhere in the 5,000,000 to 6,000,000 range, 
which looks like the sweet spot where sort starts to perform faster than topn.

> Switch to sorting node for large TopN queries
> ---------------------------------------------
>
>                 Key: IMPALA-5004
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5004
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.9.0
>            Reporter: Lars Volker
>            Assignee: Sahil Takiar
>            Priority: Major
>
> As explained by [~tarmstrong] in IMPALA-4995:
> bq. We should also consider switching to the sort operator for large limits. 
> This allows it to spill. The memory requirements for TopN also are 
> problematic for large limits, since it would allocate large vectors that are 
> untracked and also require a large amount of contiguous memory.
> There's already logic to select TopN vs. Sort: 
> [planner/SingleNodePlanner.java#L289|https://github.com/apache/incubator-impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L289]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to