Dan Hecht has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

Patch Set 14:


Flushing out some more comments, and will move onto the later patch set.

One thing that's not clear to me yet is what the right separation between node 
and sink is -- they are currently very tightly coupled which means they don't 
really benefit from being separated into separate classes.  Is the goal of this 
change solely to be able to expose the DataSink interface, or is it to provide 
some separation as well? Do we see them becoming less coupled over time?

File be/src/exec/partitioned-hash-join-builder.h:

PS14, Line 45: build row partitions
is this the same as the "hash partitions" referred to in the next paragraph. 
let's be consistent in terminology.

Line 86: 
is it expected that the renaming public functions would only be used by PHJN?  
If so, would be good to state that.

PS14, Line 90: Acquire ownership
Doesn't the ownership pass from this class to the exec node?  In which case 
this method wouldn't acquire ownership.  So let's come up with better 
terminology and naming.

PS14, Line 95: Called after probing of the partitions
             :   /// is done. The partitions are not closed or destroyed, since 
they may be spilled or
             :   /// may contain unmatched build rows for certain join modes 
(e.g. right outer join).
The fact that we need some much explanation kinda makes this seem like it might 
not be the right abstraction.  Could we instead just clear the vector when 
creating the next set of partitions (it doesn't look like we delete partitions 
until either Close() or Reset() anyway), so why do we need to proactively 
clearly the hash_partitions_ vector before starting he next iteration?

Line 106:   }
nit: missing blank line

PS14, Line 109: When this call returns
Is this stuff by this method or the caller?  i.e. if this method closes 
input_partition, then say that.  If it doesn't then either say "after" or don't 
say it at all (why is it relevant to this method documentation?).

PS14, Line 112: so that the probe phase has enough buffers preallocated
              :   /// execute successfully.
slightly garbled. seems like a word is missing or something.

PS14, Line 117: /// Iterates over all the partitions in 'hash_partitions_' and 
returns the largest
              :   /// number of build rows.
rather than say how and referring to private member, and leaving it to the 
reader to assume that "largest number of build rows" is talking about a 
per-partition thing, let's say something like:

Returns the maximum number of build rows of a hash partition 

(or whatever terminology you choose in the class comment).

Line 123:   bool HashTableStoresNulls() const;
why is this here rather than the join node? especially given that the join node 
owns the HashTableCtx.  on a related note, I'm not really sure how the 
HashTable context is going to work out once this is driven like a real sync.  
Will we still have a parent_ link to the exec node living in a different 

Line 128:   inline const std::vector<bool>& is_not_distinct_from() const {
and then should this be here? is it to be closer to being able to break the 
parent_ link at some point and make the builder able to build the hash table 
without any info from parent_?

PS14, Line 182: partition side of this partition
not sure what this is saying. maybe just "for this build partition"?

Line 188:     /// false.
maybe clarify that an error is not returned in that case.  or clarify what 
cases are errors. etc.

PS14, Line 195: unpin_all_build
why "build"?  shouldn't it be unpin_all_blocks?  or even better would be to 
name it according to what the algorithm cares about. which I think is whether 
or not we might have more build rows to append, right?

PS14, Line 234: build or probe
does this care about the probe side partitioning?

Line 237:   /// 'null_aware_probe_partition_' and 'null_probe_rows_'.
Also explain what this is computing, rather than just what the computation is.

PS14, Line 244: all hash partitions for partitioning level
this is kinda misleading. maybe qualify with ... for the current input, or the 
"current set of hash partitions for partition level...". i.e. this isn't all 
the hash partitions that are created for a particular level.

PS14, Line 252: in

PS14, Line 256: in memory
what does this mean? is it trying to say it fits in the current (pinned) block?

File be/src/exec/partitioned-hash-join-node.h:

PS14, Line 410:  we then iterate over all the probe rows
> Clarified that it's the partition's. Most of the text was just copied verba
Thanks. From the perspective of the reviewer, the comment is already part of 
the patch (the code moved in non-trivial ways), so the reviewer has to read the 
comment. Since the reviewers need to read these comments, we might as well 
improve the comments now when we can make meaningful improvements since it's a 
relatively cheap way to improve readability (doesn't lead to additional 
debugging, testing, etc).

To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to