Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Parallelise read I/O in sorter
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG@14
PS3, Line 14:   - Remove the hard-coded maximum limit of buffers that can be 
used
            :     for merging the sorted runs. Instead this number is calculated
            :     based on the available memory through buffer pool.
> Does this also apply to the final merge? If yes, does this mean that the pa
I gave this a thought after we discussed this offline and I believe that the 
parent node will will be fine in terms of it's own allocations because of the 
following reasons:
 - Digging in query profiles as I see in case we have multiple Sort nodes than 
all of them is assigned some memory that is kept by the node until it finishes 
it's work. As a result one Sort can't make another Sort starving.
 - The optimisation mentioned in this section is only made for the merging 
process but has no effect on the initial runs, so no change in terms of memory 
allocation in Open().
 - The hard-coded memory limit was set to a quite high number: 
MAX_BUFFERS_PER_MERGE=128, that means that even with var len data it allowed to 
merge ~64 runs with one merge round. In theory this was barely reached as the 
actual number of buffers per merge came after the Sorter started allocating 
memory run-by-run until it ran out of it's own memory limit.

Long story short, nothing was drastically changed here just some fine tuning so 
I believe this should be fine.

Another thing: I added a test that has 2 Sort nodes with minimal memory limit 
to verify that the lower sort doesn't consume all the memory from the upper 
sort.


http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG@26
PS3, Line 26:   - In case doing a sort on top of a join when working with a
            :     restricted amount of memory then the Sort node successfully
            :     allocates additional memory right before the merging phase.
> I found this sentence hard to understand - can you rephrase it in a more de
I added a bit deeper explanation. However, I think that writing such a timeline 
that you suggested is an overkill as sorter.h has a nice explanation about it's 
phases and the required order/purpose of the interface calls.


http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG@32
PS3, Line 32: a merge)
> typo: slightly
Done


http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG@32
PS3, Line 32:     grabbed for a merge).
> type: missing s
Done


http://gerrit.cloudera.org:8080/#/c/9943/3/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

http://gerrit.cloudera.org:8080/#/c/9943/3/be/src/runtime/sorter.cc@1037
PS3, Line 1037:   Page* next_page = &(*pages)[page_index + 1];
> I think that the original DCHECK is still useful, but not for the next page
Well, I didn't want to add an if here as well to check if that page exists. 
Instead I do so below right before I pin that page.


http://gerrit.cloudera.org:8080/#/c/9943/3/be/src/runtime/sorter.cc@1039
PS3, Line 1039:
> I would add to this comment that all pages must be pinned.
The description of is_pinned_ says that "True if all pages in the run are 
pinned", do you think it brings in additional value to mention this here as 
well?



--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Sat, 21 Apr 2018 15:22:09 +0000
Gerrit-HasComments: Yes

Reply via email to