Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9943 )
Change subject: IMPALA-5706: Parallelise read I/O in sorter ...................................................................... Patch Set 5: (6 comments) http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG@14 PS3, Line 14: - Remove the hard-coded maximum limit of buffers that can be used : for merging the sorted runs. Instead this number is calculated : based on the available memory through buffer pool. > Does this also apply to the final merge? If yes, does this mean that the pa I gave this a thought after we discussed this offline and I believe that the parent node will will be fine in terms of it's own allocations because of the following reasons: - Digging in query profiles as I see in case we have multiple Sort nodes than all of them is assigned some memory that is kept by the node until it finishes it's work. As a result one Sort can't make another Sort starving. - The optimisation mentioned in this section is only made for the merging process but has no effect on the initial runs, so no change in terms of memory allocation in Open(). - The hard-coded memory limit was set to a quite high number: MAX_BUFFERS_PER_MERGE=128, that means that even with var len data it allowed to merge ~64 runs with one merge round. In theory this was barely reached as the actual number of buffers per merge came after the Sorter started allocating memory run-by-run until it ran out of it's own memory limit. Long story short, nothing was drastically changed here just some fine tuning so I believe this should be fine. Another thing: I added a test that has 2 Sort nodes with minimal memory limit to verify that the lower sort doesn't consume all the memory from the upper sort. http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG@26 PS3, Line 26: - In case doing a sort on top of a join when working with a : restricted amount of memory then the Sort node successfully : allocates additional memory right before the merging phase. > I found this sentence hard to understand - can you rephrase it in a more de I added a bit deeper explanation. However, I think that writing such a timeline that you suggested is an overkill as sorter.h has a nice explanation about it's phases and the required order/purpose of the interface calls. http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG@32 PS3, Line 32: a merge) > typo: slightly Done http://gerrit.cloudera.org:8080/#/c/9943/3//COMMIT_MSG@32 PS3, Line 32: grabbed for a merge). > type: missing s Done http://gerrit.cloudera.org:8080/#/c/9943/3/be/src/runtime/sorter.cc File be/src/runtime/sorter.cc: http://gerrit.cloudera.org:8080/#/c/9943/3/be/src/runtime/sorter.cc@1037 PS3, Line 1037: Page* next_page = &(*pages)[page_index + 1]; > I think that the original DCHECK is still useful, but not for the next page Well, I didn't want to add an if here as well to check if that page exists. Instead I do so below right before I pin that page. http://gerrit.cloudera.org:8080/#/c/9943/3/be/src/runtime/sorter.cc@1039 PS3, Line 1039: > I would add to this comment that all pages must be pinned. The description of is_pinned_ says that "True if all pages in the run are pinned", do you think it brings in additional value to mention this here as well? -- To view, visit http://gerrit.cloudera.org:8080/9943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9 Gerrit-Change-Number: 9943 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Sat, 21 Apr 2018 15:22:09 +0000 Gerrit-HasComments: Yes
