Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11698 )

Change subject: IMPALA-5004: Switch to sorting node for large TopN queries
......................................................................


Patch Set 3:

(4 comments)

Made the following changes:
* Changed default value for topn_bytes_limit from 96MB to 512MB; this makes the 
conversion much more conservative, which should help avoid any performance 
regressions
* Added a new trigger for TopN --> Sort; if the input to TopN is >= (offset + 
limit) switch the Sort; the idea is that if TopN is going to sort the entire 
input dataset, then we should just use Sort instead
* Re-factored some of the code in SingleNodePlanner so that creation of the 
SortNode is done in a dedicated method
* Addressed Gerrit comments

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@299
PS3, Line 299: // Add a Sort Node instead of a TopN Node if the estimated size 
of the materialized
             :       // rows for TopN exceeds the value of TOPN_BYTES_LIMIT
             :       long estimatedTopNMaterializedSize = 0;
             :       f
> I'd prefer to put this block into SortInfo, called say, estimateMaterialize
Done


http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@306
PS3, Line 306:           estimatedTopNMaterializedSize += 
sortSlotDesc.getByteSize();
> how's this compare with the row size estimate in SortNode.computeNodeResour
They look similar, so I decided to combine them. In fact, the code in 
SortNode.computeNodeResourceProfile is probably more accurate because it is 
based on column stats. So I moved the logic into SortInfo so it can be shared 
between the SingleNodePlanner and SortNode.


http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@310
PS3, Line 310: *=
> I doubt we'll run into it in practice, but prefer to use PlanNode.checkedMu
Now that we are using the code in SortNode.computeNoteResourceProfile, there is 
no long * long multiplication, instead its a float * long multiplication. So 
there shouldn't an overflow issue.


http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@313
PS3, Line 313:           estimatedTopNMaterializedSize < 
ctx_.getQueryOptions().topn_bytes_limit;
> @Paul, thanks for taking a look! I agree with a lot of your concerns. I've
After discussing this a bit more with Tim, it seems that we generally avoid 
taking into account memory budget in the planner. So instead we decided to 
increase the default value to a much more conservative value to avoid any query 
regressions.

We discussed the option of making the parameter a row limit rather than a 
memory limit. A row limit may be easier to tune for a user, but would make the 
memory usage of TopN more variable depending on the query. I'm open to either 
option, but think a memory limit would end up working better.



--
To view, visit http://gerrit.cloudera.org:8080/11698
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9
Gerrit-Change-Number: 11698
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Paul Rogers <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>
Gerrit-Comment-Date: Tue, 23 Oct 2018 17:57:12 +0000
Gerrit-HasComments: Yes

Reply via email to