Alex Behm has posted comments on this change. Change subject: IMPALA-3905: HdfsScanner::GetNext() for Avro, RC, and Seq scans. ......................................................................
Patch Set 2: (27 comments) http://gerrit.cloudera.org:8080/#/c/6527/2//COMMIT_MSG Commit Message: PS2, Line 12: ProcessSpit > nit: typo Done http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.cc File be/src/exec/base-sequence-scanner.cc: Line 72: header_(nullptr), > Initialise these in the class itself to avoid the duplication. Done PS2, Line 130: nullptr > Don't need this argument. Done Line 137: static_cast<HdfsScanNode*>(scan_node_)->AddMaterializedRowBatch(row_batch); > The transfer of ownership of this row batch is confusing and inconsistent b Not opposed to the change but want to understand the reasoning a little more. I don't see how the modes are inconsistent. The caller passes the last batch into Close(), that's the rule for both MT and non-MT cases. I'd argue that having Close() create the last batch in the non-MT case is actually inconsistent. It would also add duplicate code for creating the batch in all Close() function. It's odd that we are in Close() which gets passed a batch by the caller, but then create a new batch inside this function anyway. Adjusting the comment on Close() to explain this behavior seems odd. Line 211: // scan range > The logic for updating 'eos_'is a little confusing here, since we set 'eos_ Done http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.h File be/src/exec/base-sequence-scanner.h: Line 40: /// through the first complete sync in the next range. > The positioning of line breaks is a bit weird in this paragraphs. Also mayb Done Line 86: virtual Status GetNextInternal(RowBatch* row_batch); > Can you add WARN_UNUSED_RESULT to the functions that return Status? Good idea. Do you mind if I do this in a follow-on change for all scanners? I'd like to keep this patch focused on the these specific scanners, if possible. PS2, Line 115: Read and validate sync marker against header_->syn > Maybe "Read sync marker from 'stream_' and validate against header_->sync" Better, thanks. Done. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-avro-scanner.cc File be/src/exec/hdfs-avro-scanner.cc: Line 60: avro_header_(nullptr), > Initialise these in the class itself to avoid having to update in multiple Done Line 535: VLOG_FILE << "Decompressed " << compressed_size << " to " << data_len_; > Remove this log? It doesn't seem very useful. Agree. Done. Line 557: if (codegend_decode_avro_data_ != nullptr) { > Flatten out this else if {} to reduce nesting? Done http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-avro-scanner.h File be/src/exec/hdfs-avro-scanner.h: PS2, Line 135: data_ > maybe data_block_* to be more explicit? Or group these into a struct for th Renamed to data_block_. Struct does not seem necessary to me for 3 fields, but will change if you feel strongly. PS2, Line 140: num_records_ > num_records_in_block_ or similar? Otherwise it's hard to know what it's cou Done http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 410: Status status = GetNextInternal(batch); > It's not obvious why the batch needs to be added to the queue (I assume it' Good point. Done. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: Line 58: num_rows_(0), > Init in header to be consistent with the changes I suggested to other scann Done Line 242: Status HdfsRCFileScanner::ResetRowGroup() { > This doesn't need to return a status now. I coalesced this function into StartRowGroup(). PS2, Line 464: ReadRowGroup > The name is a little confusing. ReadRowGroupHeader() or StartRowGroup()? Renamed to StartRowGroup() http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-rcfile-scanner.h File be/src/exec/hdfs-rcfile-scanner.h: Line 382: /// number of rows in this rowgroup object > Can we group all of the row-group-specific state into a struct? This would I coalesced ResetRowGroup() into StartRowGroup(). Still think we need a struct? Line 404: uint8_t* row_group_buffer_; > Use a StringBuffer instead of row_group_buffer_ + row_group_length_ + row_g StringBuffer seems misleading to me because the data is binary. StringBuffer also provides a lot of functionality we do not need here. Is this just about grouping the variables into a struct? Line 408: int row_group_length_; > I think this and row_group_buffer_size_ should be int64_ts. There's a possi Good catch. Overflow was definitely possible. Done. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-scan-node.h File be/src/exec/hdfs-scan-node.h: Line 113: boost::scoped_ptr<RowBatchQueue> materialized_row_batches_; > One of the harder things I found with this code is reasoning about the tran As a follow-on? Seems like an independent but fundamental change, want to avoid scope creep. I have a feeling the change is going to lead to more debugging. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: Line 80: : scan_node_(NULL), > Consider moving the common initialisation to the class body. Done Line 139: scan_node->AddMaterializedRowBatch(batch); > Can you comment why we need to add the batch even on error? Done http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: Line 47: current_block_length_(-1), > Init in header for consistency with suggestion elsewhere. Done http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-sequence-scanner.h File be/src/exec/hdfs-sequence-scanner.h: PS2, Line 191: int > int64_t for consistency with byte counts elsewhere Done Line 191: static const int MAX_BLOCK_SIZE = (1024 * 1024 * 1024); > extraneous parentheses Done http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/util/runtime-profile.cc File be/src/util/runtime-profile.cc: Line 34: #include "util/pretty-printer.h" > unused. Done -- To view, visit http://gerrit.cloudera.org:8080/6527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie18f57b0d3fe0052a8ccd361b6a5fcdf979d0669 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: anujphadke <[email protected]> Gerrit-HasComments: Yes
