[Impala-ASF-CR] IMPALA-5706: Parallelise read I/O in sorter

Tim Armstrong (Code Review) Wed, 02 May 2018 18:15:07 -0700

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )


Change subject: IMPALA-5706: Parallelise read I/O in sorter
......................................................................


Patch Set 7:

(16 comments)

Looking good

http://gerrit.cloudera.org:8080/#/c/9943/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9943/7//COMMIT_MSG@39
PS7, Line 39: double-buffering the number of runs in a single merge decreases 
and I
I think we still need data to decide whether the double-buffering is worth it. 
Even without that the patch is a big improvement to the logic.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@a1052
PS7, Line 1052:
Can we preserve this DCHECK but make it about the next page being pinned, or is 
it too awkward? We could move it into the if (is_pinned_) and make it something 
like:

  DCHECK(page_index + 1 == pages->size() || pages[page_index + 1]->is_pinned());


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@880
PS7, Line 880:   // Attempt to pin the first fixed and var-length pages.
Comment needs updating.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@881
PS7, Line 881:   if (fixed_len_pages_.size() > 0) {
Might be cleaner to rewrite these as loops, e.g.

  for (int i = 0; i < max(2, fixed_len_pages_.size()), i++)


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1049
PS7, Line 1049: page_index
nit: idx/index inconsistency. I'm fine with either but the inconsistency in the 
function is a little distracting.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1052
PS7, Line 1052:     DCHECK(next_unpinned_page != nullptr);
This DCHECK doesn't seem useful - if we messed up the vector addressing 
arithmetic we would almost certainly get an invalid but non-NULL pointer.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1551
PS7, Line 1551: int64_t Sorter::ComputeMinReservation() {
Not a big deal but we could reduce the vertical whitespace in this function. I 
don't mind it personally but some people prefer denser code.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1682
PS7, Line 1682:   for (int i = 0; i < sorted_runs_.size(); ++i) {
Maybe use range-based for loop? I find it more readable personally.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1706
PS7, Line 1706:   int pages_needed_for_full_merge = sorted_runs_.size() * 
PinnedPagesPerRunForMerge();
Couldn't we compute this more accurately by doing a pass over sorted_runs_? E.g.

  pages_needed = 0
  for (run : sorted_runs_) {
    pages_needed += min(2, run->num_fixed_len_pages());
    pages_needed += min(2, run->num_var_len_pages());
  }


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1721
PS7, Line 1721:   if (max_runs_per_intermediate_merge + 1 >= 
sorted_runs_.size()) {
I think my above comment about accurately computing the requirement for the 
final merge applies here too. I guess the maximum runs per intermediate merge 
and the requirement for the final merge are different since the first is an 
upper bound only.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1727
PS7, Line 1727:   if (max_runs_per_intermediate_merge > sorted_runs_.size() / 
2) {
I had a bit of trouble reasoning through whether this calculation was correct 
for edge cases.

E.g. if sorted_runs_.size() == 17 and max_runs_per_intermediate_merge == 8, 
this is false, which seems correct. But if sorted_runs_.size() == 16 and 
max_runs_per_intermediate_merge == 8, then is false, but it seems like it 
should be true, because we can:

1. Merge 8 runs into 1.
2. Merge 9 runs into the final output.

For me it might be a bit easier to understand if it was instead:

  max_runs_per_intermediate_merge * 2 >= sorted_runs_.size()

Since that avoids the rounding.


http://gerrit.cloudera.org:8080/#/c/9943/7/testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test:

PS7:
IMPALA-4835 probably caused a bunch of conflicts in the planner tests. 
Hopefully it isn't too much of a pain to regenerate the output - you can copy 
over the actual output from logs/fe_Tests/PlannerTests (maybe you're already 
doing that).


http://gerrit.cloudera.org:8080/#/c/9943/7/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

http://gerrit.cloudera.org:8080/#/c/9943/7/tests/query_test/test_sort.py@45
PS7, Line 45:   def test_multiple_mem_limits(self, vector):
Are these tests now tuned for a 3-node minicluster? Maybe we should skip 
validating TotalMergesPerformed if we're not running against such a minicluster 
- I'm not sure if it will still pass on a one-node local build.


http://gerrit.cloudera.org:8080/#/c/9943/7/tests/query_test/test_sort.py@65
PS7, Line 65:       assert "TotalMergesPerformed: " + test_input['merges'] in 
query_result.runtime_profile
Long line.


http://gerrit.cloudera.org:8080/#/c/9943/7/tests/query_test/test_sort.py@121
PS7, Line 121:     """Runs a query having multiple Sort nodes on top of each 
other both with intermediate
Could we get this to run quicker on a smaller table or selecting a smaller 
number of columns? Would be ideal to make this quicker


http://gerrit.cloudera.org:8080/#/c/9943/7/tests/query_test/test_sort.py@143
PS7, Line 143:     sort2_profile = 
self.get_sort_node_profile(3,result.runtime_profile)
nit: space after ,



--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Thu, 03 May 2018 01:14:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5706: Parallelise read I/O in sorter

Reply via email to