Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15963 )

Change subject: IMPALA-6692: Trigger sort node run before hitting memory limit.

Patch Set 12:


Sorry for the many grammar comments, I was also the victim of this in the past 

My only real concern is about the case when the cardinality is unknown. My 
preference would be to try to allow spilling in that case.

File be/src/exec/sort-node.h:

PS12, Line 77: going
nit: "will go"?

File be/src/exec/sort-node.cc:

PS12, Line 89: cardinality
What is the default value of this? Can it be -1 (unknown)? The result seems 
pretty wrong in that case.

PS12, Line 90: GetRowSize
I think that this doesn't contain varlen data, so it can greatly underestimate 
the input size if there are strings.

PS12, Line 92: 2)
I think that VLOG(3) is enough.

File be/src/runtime/sorter.h:

PS12, Line 101: a
"the" would be better

PS12, Line 101:   /// 'estimated_input_size' is a total rows in bytes that 
estimated to get added into
nit: missing "are"

PS12, Line 102:   /// this sorter. This is used to decide if sorter need to 
proactively spilling for
nit: needs

PS12, Line 102: spilling
nit: spill

PS12, Line 223: run
nit: "do an"?

File be/src/runtime/sorter.cc:

PS12, Line 816:       VLOG(2) << Substitute(
I think that VLOG(3) is enough here - this should happen if the cardinality 
estimation was wrong, which may make WARNING logical, but this seems 
unavoidable for many queries, so I wouldn't spam the warning log.

PS12, Line 907:         VLOG(2) << Substitute(
Same as line 816.

File common/thrift/ImpalaInternalService.thrift:

PS12, Line 645: backeds
typo: backends

File tests/query_test/test_sort.py:

PS12, Line 74:     """The first sort run is given a privilege to ignore 
sort_run_bytes_limit, except
             :        when estimate hints that spill is inevitable. The lower 
sort_run_bytes_limit of
             :        a query is, the more sort runs are likely to be produced.
             :        Case 1 : 1 run produced, because all rows fit within the 
maximum reservation.
             :                 sort_run_bytes_limit is not enforced.
             :        Case 2 : 3 run produced, because the first run hit 
reservation limit, and the
             :                 next 2 runs are capped to 150m.
             :        Case 3 : 4 run produced, because sort node estimate that 
spill is inevitable.
             :                 So all runs are capped to 130m, including the 
first one."""
Isn't there something in query_result.runtime_profile that could be used to 
check some of these statements? E.g. I think we can check that no spilling 
occurred for case 1, but it did occur for case 2 and 3

To view, visit http://gerrit.cloudera.org:8080/15963
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2a0ba7c4bae4f1d300d4d9d7f594f63ced06a240
Gerrit-Change-Number: 15963
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: David Rorke <dro...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Jun 2020 15:00:47 +0000
Gerrit-HasComments: Yes

Reply via email to