Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3905: Add HdfsScanner::GetNext() interface and 
implementation for Parquet.
......................................................................


Patch Set 2:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 139:     scan_node_->ThreadTokenAvailableCb(state_->resource_pool());
Has this just been moved from somewhere else? I don't understand this part of 
the change.


http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 139:       skip_row_group_(true),
It looks like it's set to false before it's used anyway, but  it's 
counterintuitive to initialise this to true.


Line 170:   DCHECK(parse_status_.ok()) << "Invalid parse_status_" << 
parse_status_.GetDetail();
It seems like a lot of the below work corresponds more to the Open() phase of 
exec nodes instead of Prepare(). Maybe we should rename HdfsScanner::Prepare() 
to HdfsScanner::Open() that that the lifecycle lines up closer to the ExecNode 
lifecycle?


PS2, Line 236: NULL
Passing NULL argument to reset() is redundant.


Line 310:     assemble_rows_timer_.Start();
Seems easier to use a scoped timer here and avoid splitting the function call 
and RETURN_IF_ERROR below.


Line 312:     int max_tuples = min<int64_t>(row_batch->capacity(), 
rows_remaining);
Why the int64_t here? Everything else involved is an int.


Line 325:     assemble_rows_timer_.Start();
SCOPED_TIMER?


Line 354:   if (!scan_node_->PartitionPassesFilterPredicates(
Wrapping seems weird here - ok if it's just a unfortunate consequence of the 
line lengths.


Line 402:       // This row group will be handled by a different scanner.
Maybe explain briefly why? It is because it's included in the previous or next 
split?


PS2, Line 409: seeding_succeeded
seeded_ok? It's bit of a mouthful.


Line 439: void HdfsParquetScanner::TransferResources(RowBatch* row_batch) {
It's not immediately clear what resources this is transferring and when it's 
safe to call it.


Line 443:   for (ParquetColumnReader* r: column_readers_) 
r->TransferResources(row_batch);
I don't feel strongly about this, but we don't generally do one-line for 
statements like this.


PS2, Line 514:  scan node
Update comment - there won't necessarily be a scan node.


http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

Line 270:     ThreadTokenAvailableCb(runtime_state_->resource_pool());
I'm not sure that I understand the intended effect of these 
ThreadTokenAvailableCb() changes. Was it just refactoring?


Line 876: 
Extraneous new line.


http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

Line 114:   /// The memory referenced by tuples in the returned batch is valid 
until Close().
This isn't the guarantee that ExecNode::GetNext() provides (actually I'm not 
sure if these scanners guarantee it) and I don't think it's the one we want. 
I'm reading this as that you can assume that any memory referenced from the 
first batch returned is valid up until you call Close(), regardless of what you 
do with the batch.

The guarantee is more like "The memory referenced by the tuples is valid until 
either Close() is called, the batch is reset or destroyed, or a batch returned 
from a subsequent GetNext() call is reset or destroyed."


PS2, Line 115:  bool* eos
One idea I've thought about for ExecNode is that it might be simpler to make 
eos_ a member, rather than returning it from GetNext(). This might be a good 
place to test the idea if you think it is worthwhile.

When it's an output parameter it gets a little finicky trying to make sure that 
it's set correctly on all code paths and also making sure that the caller 
doesn't lose track of whether it got an 'eos' flag or not.


Line 124:   virtual void TransferResources(RowBatch* row_batch) {
I'm not sure that it makes sense to make this as part of the HdfsScanner 
interface since the resource management of each scanner is somewhat particular 
to the file format. I can't think of a scenario when it would make sense for an 
external class to call this, since there's no general rule about when it's safe 
to transfer resources.

E.g. in Parquet it would probably be clearer to have a 
TransferRowGroupResources() function.


http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 228:   virtual void TransferResources(RowBatch* row_batch) = 0;
When is it safe to call this?


-- 
To view, visit http://gerrit.cloudera.org:8080/3732
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iab50770bac05afcda4d3404fb4f53a0104931eb0
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to