Tim Armstrong has posted comments on this change. Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder ......................................................................
Patch Set 9: (17 comments) http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/analytic-eval-node.cc File be/src/exec/analytic-eval-node.cc: Line 206: Status status = Status::MemLimitExceeded(); > let's make the error reporting for PrepareForWrite() and PrepareForRead() m Good point, this discrepancy crept in in a rebase. http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/data-sink.cc File be/src/exec/data-sink.cc: Line 165: if (expr_mem_tracker_ != NULL) { > the function comment makes it sound like we should also have: Done. Makes it more obvious. http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/nested-loop-join-builder.cc File be/src/exec/nested-loop-join-builder.cc: Line 35 > revert? I just fixed the formatting while I was looking at it, I think the new formatting is more normal. http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: Line 310: for (unique_ptr<Partition>& partition: all_partitions_) partition->Close(NULL); > nit: extra space before 'partition' Done Line 464: // SwitchToIoBuffers() call below will succeed. > Does this comment make sense? Can we use small buffers and have a stream in Yes (aside from SwitchToIoBuffers() being a weird API). The build_stream() started off with small buffers so might not have switched to I/O buffers at this point. I think it's unlikely that we'll spill a partition that hadn't switched to I/O buffers but I think it's possible. This will go away with the new buffer pool. http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/partitioned-hash-join-builder.h File be/src/exec/partitioned-hash-join-builder.h: Line 46: class PhjBuilder : public DataSink { > PartitionedHashJoinBuilder for consistency? It's for consistency with NljBuilder (Marcel suggested that name on the previous review). Line 69: /// This function walks the current hash partitions and picks one to spill. > ... and spills the largest one Done, although made it clear that it's a policy choice. Line 174: /// transferring ownership to them. > this line doesn't seem to add anything Done Line 189: /// all blocks aside from the current read or write block are left pinned. This comment was wrong - fixed it. http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/partitioned-hash-join-node.cc File be/src/exec/partitioned-hash-join-node.cc: Line 865: Status PartitionedHashJoinNode::BuildHashTablesAndPrepareProbePartitions() { > It would be good to document what exactly is handed off from the PHJ builde I agree that should be the end-goal. I think for now it's tricky to do better in the spilling case than this approach of allocating probe buffers, then hash tables, then remaining probe buffers until we have reservations. When we need to implement the in-memory broadcast join, we could add a temporary in-memory codepath that did the hash table build purely in the builder, but just bailed out if any partitions spilled. One stopgap option (which is feasible but ugly) that lets us implement the current algorithm in the builder is to allocate the initial probe Blocks in the builder, and then hand them over to the join. PrepareForWrite() could then optionally take a Block as an argument. What do you think of doing that? Line 891: for (int i = 0; i < PARTITION_FANOUT; ++i) { I moved the hash table build calls into the builder, as an incremental step towards separating this more cleanly. Line 959: RETURN_IF_ERROR(builder_->SpillPartition(&spilled_partition_idx)); > It feels like there must be a simpler and more efficient way to handle this Yeah, if we know how much memory we have to work with we can precompute the sizes of the hash tables and do some kind of bin-packing. I don't think this is feasible until we have reservations. http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/partitioned-hash-join-node.h File be/src/exec/partitioned-hash-join-node.h: Line 79: /// Logically, the algorithm has the following modes:. > would be good to document somewhere (not necessarily here) what phases are I added a comment to the builder class to better explain the responsibilities of the builder. Line 150: bool AppendProbeRow(BufferedTupleStream* stream, TupleRow* row, Status* status); I changed this to just return a Status, since there's no situation where it can return false but have ok status. Line 445: /// Close the partition and attach resources to 'batch' if non-NULL or free the > can you make the comment on BuildPartition::Close() similar to this one? th Done http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/runtime/buffered-tuple-stream.h File be/src/runtime/buffered-tuple-stream.h: Line 231: /// Prepares the stream for reading. Called after Init() and before the first AddRow() > maybe expand to: Done http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/util/runtime-profile.cc File be/src/util/runtime-profile.cc: Line 400: AddChild(child, indent); > shouldn't the addition and the swap be atomic under children_lock_? Good point. I added an internal AddChildLocked() method and made sure to grab children_lock_ for the duration of this method. -- To view, visit http://gerrit.cloudera.org:8080/3873 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
