Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/11698 )
Change subject: IMPALA-5004: Switch to sorting node for large TopN queries ...................................................................... Patch Set 3: (4 comments) Made the following changes: * Changed default value for topn_bytes_limit from 96MB to 512MB; this makes the conversion much more conservative, which should help avoid any performance regressions * Added a new trigger for TopN --> Sort; if the input to TopN is >= (offset + limit) switch the Sort; the idea is that if TopN is going to sort the entire input dataset, then we should just use Sort instead * Re-factored some of the code in SingleNodePlanner so that creation of the SortNode is done in a dedicated method * Addressed Gerrit comments http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@299 PS3, Line 299: // Add a Sort Node instead of a TopN Node if the estimated size of the materialized : // rows for TopN exceeds the value of TOPN_BYTES_LIMIT : long estimatedTopNMaterializedSize = 0; : f > I'd prefer to put this block into SortInfo, called say, estimateMaterialize Done http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@306 PS3, Line 306: estimatedTopNMaterializedSize += sortSlotDesc.getByteSize(); > how's this compare with the row size estimate in SortNode.computeNodeResour They look similar, so I decided to combine them. In fact, the code in SortNode.computeNodeResourceProfile is probably more accurate because it is based on column stats. So I moved the logic into SortInfo so it can be shared between the SingleNodePlanner and SortNode. http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@310 PS3, Line 310: *= > I doubt we'll run into it in practice, but prefer to use PlanNode.checkedMu Now that we are using the code in SortNode.computeNoteResourceProfile, there is no long * long multiplication, instead its a float * long multiplication. So there shouldn't an overflow issue. http://gerrit.cloudera.org:8080/#/c/11698/3/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@313 PS3, Line 313: estimatedTopNMaterializedSize < ctx_.getQueryOptions().topn_bytes_limit; > @Paul, thanks for taking a look! I agree with a lot of your concerns. I've After discussing this a bit more with Tim, it seems that we generally avoid taking into account memory budget in the planner. So instead we decided to increase the default value to a much more conservative value to avoid any query regressions. We discussed the option of making the parameter a row limit rather than a memory limit. A row limit may be easier to tune for a user, but would make the memory usage of TopN more variable depending on the query. I'm open to either option, but think a memory limit would end up working better. -- To view, visit http://gerrit.cloudera.org:8080/11698 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9 Gerrit-Change-Number: 11698 Gerrit-PatchSet: 3 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Paul Rogers <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Vuk Ercegovac <[email protected]> Gerrit-Comment-Date: Tue, 23 Oct 2018 17:57:12 +0000 Gerrit-HasComments: Yes
