Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3905: HdfsScanner::GetNext() for Avro, RC, and Seq scans.
......................................................................


Patch Set 2:

(26 comments)

No major concerns, I think there's a lot of opportunity to make the code easier 
to follow though.

http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 72:     header_(nullptr),
Initialise these in the class itself to avoid the duplication.


PS2, Line 130: nullptr
Don't need this argument.


Line 137:       
static_cast<HdfsScanNode*>(scan_node_)->AddMaterializedRowBatch(row_batch);
The transfer of ownership of this row batch is confusing and inconsistent 
between the two modes. 

I think it would be more consistent if, when there's a queue, Close() creates 
the row batch itself rather than having the caller do it.


Line 211:   // scan range
The logic for updating 'eos_'is a little confusing here, since we set 'eos_' 
then keep reading. Maybe we could instead have a local variable called 
something like 'last_sync' here and set 'eos_ = last_sync || stream_->eof()' 
below.


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.h
File be/src/exec/base-sequence-scanner.h:

Line 40: /// through the first complete sync in the next range.
The positioning of line breaks is a bit weird in this paragraphs. Also maybe it 
would be good to have a heading like "Handling of sync markers" so it's easier 
to figure out what the comment is about when scanning through this file.


Line 86:   virtual Status GetNextInternal(RowBatch* row_batch);
Can you add WARN_UNUSED_RESULT to the functions that return Status?


PS2, Line 115:  Read and validate sync marker against header_->syn
Maybe "Read sync marker from 'stream_' and validate against header_->sync"


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-avro-scanner.cc
File be/src/exec/hdfs-avro-scanner.cc:

Line 60:     avro_header_(nullptr),
Initialise these in the class itself to avoid having to update in multiple 
places.


Line 535:       VLOG_FILE << "Decompressed " << compressed_size << " to " << 
data_len_;
Remove this log? It doesn't seem very useful.


Line 557:       if (codegend_decode_avro_data_ != nullptr) {
Flatten out this else if {} to reduce nesting?


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-avro-scanner.h
File be/src/exec/hdfs-avro-scanner.h:

PS2, Line 135: data_
maybe data_block_* to be more explicit? Or group these into a struct for the 
data block?


PS2, Line 140: num_records_
num_records_in_block_ or similar? Otherwise it's hard to know what it's 
counting.


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 410:     Status status = GetNextInternal(batch);
It's not obvious why the batch needs to be added to the queue (I assume it's 
because previously-appended batches may reference data that was attached to 
'batch'). Should add a comment so people don't accidentally break this.


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

Line 58:       num_rows_(0),
Init in header to be consistent with the changes I suggested to other scanners ?


Line 242: Status HdfsRCFileScanner::ResetRowGroup() {
This doesn't need to return a status now.


PS2, Line 464: ReadRowGroup
The name is a little confusing. ReadRowGroupHeader() or StartRowGroup()?


Line 526:         ErrorMsg msg(TErrorCode::GENERAL, Substitute("file: $0", 
stream_->filename()));
The handling of errors here is really wonky. We return a different error from 
what we logged in ReportColumnParseError().

I don't think we need to fix it in this patch but maybe file a JIRA?


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-rcfile-scanner.h
File be/src/exec/hdfs-rcfile-scanner.h:

Line 382:   /// number of rows in this rowgroup object
Can we group all of the row-group-specific state into a struct? This would make 
it more explicit what needs to be reset between row groups. Then 
ResetRowGroup() could just be a method  on the struct.


Line 404:   uint8_t* row_group_buffer_;
Use a StringBuffer instead of row_group_buffer_ + row_group_length_ + 
row_group_buffer_size_? You'd need to return the row group length explicitly 
from ReadKeyBuffers() to make this work, but I think that would make it even 
clearer.


Line 408:   int row_group_length_;
I think this and row_group_buffer_size_ should be int64_ts. There's a 
possibility of overflow.


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-scan-node.h
File be/src/exec/hdfs-scan-node.h:

Line 113:   boost::scoped_ptr<RowBatchQueue> materialized_row_batches_;
One of the harder things I found with this code is reasoning about the transfer 
of the row batches, especially with error paths. It would be clearer if we used 
unique_ptr in RowBatchQueue to document/enforce the transfer of ownership.

This probably isn't worth doing if it's difficult but if  it's an easy change 
it could help flush out any latent bugs.


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

Line 80:     : scan_node_(NULL),
Consider moving the common initialisation to the class body.


Line 139:     scan_node->AddMaterializedRowBatch(batch);
Can you comment why we need to add the batch even on error?


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-sequence-scanner.cc
File be/src/exec/hdfs-sequence-scanner.cc:

Line 47:       current_block_length_(-1),
Init in header for consistency with suggestion elsewhere.


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-sequence-scanner.h
File be/src/exec/hdfs-sequence-scanner.h:

PS2, Line 191: int
int64_t for consistency with byte counts elsewhere


Line 191:   static const int MAX_BLOCK_SIZE = (1024 * 1024 * 1024);
extraneous parentheses


-- 
To view, visit http://gerrit.cloudera.org:8080/6527
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie18f57b0d3fe0052a8ccd361b6a5fcdf979d0669
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to