Alex Behm has posted comments on this change. Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder ......................................................................
Patch Set 9: (15 comments) I'm not quite done with reviewing everything, but getting close. http://gerrit.cloudera.org:8080/#/c/3873/9//COMMIT_MSG Commit Message: Line 22: spill additional partitions during probing if the probe partitions needed your new solution is much saner, thank you! http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/analytic-eval-node.cc File be/src/exec/analytic-eval-node.cc: Line 206: Status status = Status::MemLimitExceeded(); let's make the error reporting for PrepareForWrite() and PrepareForRead() more similar, e.g., by adding a PREPARE_FOR_WRITE_FAILED_ERROR_MSG and doing the same as for the read case http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/data-sink.cc File be/src/exec/data-sink.cc: Line 165: if (expr_mem_tracker_ != NULL) { the function comment makes it sound like we should also have: if (closed_) return; http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/nested-loop-join-builder.cc File be/src/exec/nested-loop-join-builder.cc: Line 35 revert? http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: Line 310: for (unique_ptr<Partition>& partition: all_partitions_) partition->Close(NULL); nit: extra space before 'partition' Line 464: // SwitchToIoBuffers() call below will succeed. Does this comment make sense? Can we use small buffers and have a stream in the same partition? http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/partitioned-hash-join-builder.h File be/src/exec/partitioned-hash-join-builder.h: Line 46: class PhjBuilder : public DataSink { PartitionedHashJoinBuilder for consistency? Line 69: /// This function walks the current hash partitions and picks one to spill. ... and spills the largest one Line 174: /// transferring ownership to them. this line doesn't seem to add anything http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/partitioned-hash-join-node.cc File be/src/exec/partitioned-hash-join-node.cc: Line 865: Status PartitionedHashJoinNode::BuildHashTablesAndPrepareProbePartitions() { It would be good to document what exactly is handed off from the PHJ builder to the PHJ node (e.g., in the PHJ builder class comment) To me it would make sense to move the actual HT construction into the builder as well to accommodate the in-memory broadcast join scenario (the probe threads expect the HT to be built). I think we should strive to not spill build partitions in the hash join node to maintain a clean separation, but I realize this would be a drastic departure from the existing code since we re-partition spilled partitions lazily during the probe. If we had reservations, it seems like we could determine the re-partitioning requirements before the probe. Line 959: RETURN_IF_ERROR(builder_->SpillPartition(&spilled_partition_idx)); It feels like there must be a simpler and more efficient way to handle this case - we are going to spill the largest partition here, dropping the constructed HT. At the same time it's really an edge case, so probably better to defer. It's certainly better than before :) http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/exec/partitioned-hash-join-node.h File be/src/exec/partitioned-hash-join-node.h: Line 79: /// Logically, the algorithm has the following modes:. would be good to document somewhere (not necessarily here) what phases are executed where, i.e., in the builder or the join node. seems like we'll end up with the build phase finishing the level0 partitioning and HT construction, and then the join may need to repartition probe and build partitions Line 445: /// Close the partition and attach resources to 'batch' if non-NULL or free the can you make the comment on BuildPartition::Close() similar to this one? this new one is much better http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/runtime/buffered-tuple-stream.h File be/src/runtime/buffered-tuple-stream.h: Line 231: /// Prepares the stream for reading. Called after Init() and before the first AddRow() maybe expand to: Prepares the stream for writing by attempting to pin a block to buffer the writes. http://gerrit.cloudera.org:8080/#/c/3873/9/be/src/util/runtime-profile.cc File be/src/util/runtime-profile.cc: Line 400: AddChild(child, indent); shouldn't the addition and the swap be atomic under children_lock_? the interfaces existing interfaces seem to make atomicity difficult (AddChild() takes and releases lock) -- To view, visit http://gerrit.cloudera.org:8080/3873 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
