[Impala-CR](cdh5-trunk) IMPALA-3567: Part 1: groundwork to make Join build sides DataSinks

Alex Behm (Code Review) Mon, 18 Jul 2016 18:07:49 -0700

Alex Behm has posted comments on this change.

Change subject: IMPALA-3567: Part 1: groundwork to make Join build sides 
DataSinks
......................................................................



Patch Set 11:

(35 comments)

http://gerrit.cloudera.org:8080/#/c/3282/8/be/src/exec/blocking-join-node.h
File be/src/exec/blocking-join-node.h:

Line 23: #include "exec/data-sink.h"
forward declare DataSink


Line 98:   /// Prepare() work, e.g. codegen.
Nit, but I don't think we construct the "build side". The build side is an exec 
node child. We consume the build side and construct a join build.

Imo, the original ConstructBuildSide() seems to be somewhat of a misnomer also. 
What do you think?

Suggest:
Constructs the join build from the right child's output...


Line 104:   /// Subclasses should close any other structures and then call
... the Open() and build ...


Line 105:   /// BlockingJoinNode::Close().
permit (plural)


Line 106:   virtual void Close(RuntimeState* state);
Suggest: If the build_sink is non-NULL, uses it to construct the join build. 
Otherwise, ...


Line 112:   TJoinOp::type join_op_;
... to construct the join build via a data sink.

(the "build side" does not implement the DataSink interface)


Line 118: 
left child (probe side).


Line 123:   TupleRow* current_probe_row_;  // The row currently being probed
typo: extra '.'


Line 196: 
nit: I don't think we use the DataSink interface to construct the join build. 
We use the data sink.

Suggest: Constructs the join build via 'build_sink", if it is non-NULL. 
Otherwise, ...


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/exec/blocking-join-node.h
File be/src/exec/blocking-join-node.h:

Line 42: /// BlockingJoinNode. Reset() is added to support subplans and more 
efficient allocation
remove "is added"


Line 43: /// of RowBatch objects is enabled with the GetNextBatchForBuild() 
interface.
remove "is enabled"


Line 51:   /// Frees resources associated with cache row batches.
cached row batches


Line 52:   /// Subclasses must call JoinBuildSink::Close().
is the order of Close() calls important? If so, mention which Close() should be 
called first.


Line 55:   /// Returns the next build batch that should be filled and passed to 
AddBuildBatch()
Returns an empty build batch that should be filled and passed to Send().

(I don't see an AddBuildBatch() anywhere)


Line 56:   /// or Send(). For join types that accumulates batches, this helps 
avoid additional
accumulate


Line 60:   inline RowBatch* GetNextBatchForBuild() { return 
build_batch_cache_.GetNextBatch(); }
GetEmptyBatch() or GetNextEmptyBatch()? The important part seems to be that the 
batch is empty.

Probably something to be discussed f2f:
I'm thinking that it actually makes sense to make this part of the DataSink 
interface. The join build can implement this via a cache for the join types 
that need it.


Line 70:   const bool using_builder_interface_;
Isn't the accumulates_batches_ flag sufficient? The comment on 
accumulates_batches_ even states that GetNextBatchForBuild() should either be 
called just once, or once for every Send().


Line 76:   const bool accumulates_batches_;
Couldn't this be done by making GetNextBatchForBuild() virtual? My thinking is 
that we could hide those details completely in the sink itself. We could also 
move the row batch cache to only the relevant build sinks.


Line 150:   virtual Status ConstructBuildSide(RuntimeState* state) = 0;
Terminology nit that is not your fault: To me "Build Side" means "the exec node 
child used for constructing the join build". So imo this function should be 
called ConstructJoinBuild(), and similar changes elsewhere.
What do you think?


Line 154:   /// If build_sink is non-NULL, the sink interface will be used to 
construct the build
If build_sink is non-NULL, uses it to construct the build side.

(the sink "interface" can't really construct anything, we use the build_sink 
instance)


http://gerrit.cloudera.org:8080/#/c/3282/8/be/src/exec/data-sink.cc
File be/src/exec/data-sink.cc:

Line 166
single line


http://gerrit.cloudera.org:8080/#/c/3282/8/be/src/exec/data-sink.h
File be/src/exec/data-sink.h:

Line 44: /// Clients of the DataSink interface drive the data sink by 
repeatedly calling Send()
These changes make a lot of sense to me


Line 71:   virtual Status FlushFinal(RuntimeState* state) = 0;
The important piece seems to be that the returned batch is empty and is 
expected to be populated, so how about:

CreateEmptyBatch() or GetEmptyBatch() or NextEmptyBatch()

Seems to make more sense when explaining the flow:
b = sink.CreateEmptyBatch();
plan.GetNext(b);
sink.Send(b);


Line 104:   /// Owned by the RuntimeState.
const methods


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/exec/hash-join-node.cc
File be/src/exec/hash-join-node.cc:

Line 207:   RETURN_IF_ERROR(BlockingJoinNode::GetFirstProbeRow(state));
Seems a little weird that we prepare for probe after getting the first probe 
row.


http://gerrit.cloudera.org:8080/#/c/3282/8/be/src/exec/hash-join-node.h
File be/src/exec/hash-join-node.h:

Line 70:   /// holds everything referenced from build side
Can you spell out 'everything' a little more?


Line 120:   void ResetForProbe();
PrepareForProbe()?


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/exec/hash-join-node.h
File be/src/exec/hash-join-node.h:

Line 70:   /// holds everything referenced from build side
can you spell out 'everything' a little more? is it memory referenced from 
build batches?


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/exec/hdfs-table-writer.cc
File be/src/exec/hdfs-table-writer.cc:

Line 20: #include "runtime/row-batch.h"
remove?


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/exec/nested-loop-join-builder.h
File be/src/exec/nested-loop-join-builder.h:

Line 24: /// Builder for the NestedLoopJoinNode operator that accumulates the 
build-side rows
remove "operator"


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/exec/nested-loop-join-node.cc
File be/src/exec/nested-loop-join-node.cc:

Line 59:   SCOPED_TIMER(runtime_profile_->total_time_counter());
won't this double count the time spent in BlockingJoinNode::Open()? I think 
should be moved down a line


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/exec/nested-loop-join-node.h
File be/src/exec/nested-loop-join-node.h:

Line 59:   boost::scoped_ptr<NljBuilder> build_side_;
build_sink_?


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

Line 242:   SCOPED_TIMER(runtime_profile_->total_time_counter());
I think the counter should start after BlockingJoinNode::Open()


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/exec/row-batch-cache.h
File be/src/exec/row-batch-cache.h:

Line 61:     row_batches_.clear(); // unique_ptr automatically calls all 
destructors.
Add a DCHECK in the RowBatchCache d'tor to make sure row_batches_ is empty?


http://gerrit.cloudera.org:8080/#/c/3282/11/be/src/util/stopwatch.h
File be/src/util/stopwatch.h:

Line 241:   ScopedStopWatch(T* sw, bool enabled = true) :
The enabled flag seems awkward because it's not clear why anyone would use it 
(outside of a macro). Any alternative solutions?


-- 
To view, visit http://gerrit.cloudera.org:8080/3282
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9d7608181eeacfe706a09c1e153d0a3e1ee9b475
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3567: Part 1: groundwork to make Join build sides DataSinks

Reply via email to