Tim Armstrong has posted comments on this change. Change subject: IMPALA-3905: HdfsScanner::GetNext() for Avro, RC, and Seq scans. ......................................................................
Patch Set 2: (26 comments) No major concerns, I think there's a lot of opportunity to make the code easier to follow though. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.cc File be/src/exec/base-sequence-scanner.cc: Line 72: header_(nullptr), Initialise these in the class itself to avoid the duplication. PS2, Line 130: nullptr Don't need this argument. Line 137: static_cast<HdfsScanNode*>(scan_node_)->AddMaterializedRowBatch(row_batch); The transfer of ownership of this row batch is confusing and inconsistent between the two modes. I think it would be more consistent if, when there's a queue, Close() creates the row batch itself rather than having the caller do it. Line 211: // scan range The logic for updating 'eos_'is a little confusing here, since we set 'eos_' then keep reading. Maybe we could instead have a local variable called something like 'last_sync' here and set 'eos_ = last_sync || stream_->eof()' below. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.h File be/src/exec/base-sequence-scanner.h: Line 40: /// through the first complete sync in the next range. The positioning of line breaks is a bit weird in this paragraphs. Also maybe it would be good to have a heading like "Handling of sync markers" so it's easier to figure out what the comment is about when scanning through this file. Line 86: virtual Status GetNextInternal(RowBatch* row_batch); Can you add WARN_UNUSED_RESULT to the functions that return Status? PS2, Line 115: Read and validate sync marker against header_->syn Maybe "Read sync marker from 'stream_' and validate against header_->sync" http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-avro-scanner.cc File be/src/exec/hdfs-avro-scanner.cc: Line 60: avro_header_(nullptr), Initialise these in the class itself to avoid having to update in multiple places. Line 535: VLOG_FILE << "Decompressed " << compressed_size << " to " << data_len_; Remove this log? It doesn't seem very useful. Line 557: if (codegend_decode_avro_data_ != nullptr) { Flatten out this else if {} to reduce nesting? http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-avro-scanner.h File be/src/exec/hdfs-avro-scanner.h: PS2, Line 135: data_ maybe data_block_* to be more explicit? Or group these into a struct for the data block? PS2, Line 140: num_records_ num_records_in_block_ or similar? Otherwise it's hard to know what it's counting. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 410: Status status = GetNextInternal(batch); It's not obvious why the batch needs to be added to the queue (I assume it's because previously-appended batches may reference data that was attached to 'batch'). Should add a comment so people don't accidentally break this. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: Line 58: num_rows_(0), Init in header to be consistent with the changes I suggested to other scanners ? Line 242: Status HdfsRCFileScanner::ResetRowGroup() { This doesn't need to return a status now. PS2, Line 464: ReadRowGroup The name is a little confusing. ReadRowGroupHeader() or StartRowGroup()? Line 526: ErrorMsg msg(TErrorCode::GENERAL, Substitute("file: $0", stream_->filename())); The handling of errors here is really wonky. We return a different error from what we logged in ReportColumnParseError(). I don't think we need to fix it in this patch but maybe file a JIRA? http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-rcfile-scanner.h File be/src/exec/hdfs-rcfile-scanner.h: Line 382: /// number of rows in this rowgroup object Can we group all of the row-group-specific state into a struct? This would make it more explicit what needs to be reset between row groups. Then ResetRowGroup() could just be a method on the struct. Line 404: uint8_t* row_group_buffer_; Use a StringBuffer instead of row_group_buffer_ + row_group_length_ + row_group_buffer_size_? You'd need to return the row group length explicitly from ReadKeyBuffers() to make this work, but I think that would make it even clearer. Line 408: int row_group_length_; I think this and row_group_buffer_size_ should be int64_ts. There's a possibility of overflow. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-scan-node.h File be/src/exec/hdfs-scan-node.h: Line 113: boost::scoped_ptr<RowBatchQueue> materialized_row_batches_; One of the harder things I found with this code is reasoning about the transfer of the row batches, especially with error paths. It would be clearer if we used unique_ptr in RowBatchQueue to document/enforce the transfer of ownership. This probably isn't worth doing if it's difficult but if it's an easy change it could help flush out any latent bugs. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: Line 80: : scan_node_(NULL), Consider moving the common initialisation to the class body. Line 139: scan_node->AddMaterializedRowBatch(batch); Can you comment why we need to add the batch even on error? http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: Line 47: current_block_length_(-1), Init in header for consistency with suggestion elsewhere. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-sequence-scanner.h File be/src/exec/hdfs-sequence-scanner.h: PS2, Line 191: int int64_t for consistency with byte counts elsewhere Line 191: static const int MAX_BLOCK_SIZE = (1024 * 1024 * 1024); extraneous parentheses -- To view, visit http://gerrit.cloudera.org:8080/6527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie18f57b0d3fe0052a8ccd361b6a5fcdf979d0669 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
