Tim Armstrong has posted comments on this change. Change subject: IMPALA-3905: Add single-threaded scan node. ......................................................................
Patch Set 1: (22 comments) I did an initial pass. I still need to think through some of the details but I thought I'd flush out these comments. So far everything seems sane :) http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-avro-scanner.h File be/src/exec/hdfs-avro-scanner.h: PS1, Line 93: bool add_batches_to_queue This seems somewhat redundant with the 'scan_node' parameter. Can we add a virtual method to HdfsScanNodeBase to determine if it's the single-threaded or multi-threaded version? http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: Line 678: DCHECK(value_ctx->is_clone()); This DCHECK needs a little explanation. http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 69: HdfsFileDesc(const std::string& filename) Space before function? I think we should also put the constructors up the top while we're here. This order is non-standard. http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-mt.cc File be/src/exec/hdfs-scan-node-mt.cc: Line 72: scanner_->Close(row_batch); We can reset 'scanner_' at this point, right? Line 75: RETURN_IF_ERROR(runtime_state_->io_mgr()->GetNextRange(reader_context_, &scan_range_)); Long line. Line 93: num_rows_returned_ += row_batch->num_rows(); I think we should rework the control flow to avoid executing this big block of code when status is non-ok. E.g. one concern is that it would be easy to add something in there that overwrote the non-ok status with an ok status. http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-mt.h File be/src/exec/hdfs-scan-node-mt.h: Line 52: bool scanner_eos_; This is the pattern we use elsewhere in the code, but IMO this would be simpler if the scanner has an 'eos_' field and 'eos()' method instead of relying on the caller to pass in an argument and save it. That doesn't really add any more state to the scanner, because it already had an implicit 'eos' state. http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node.h File be/src/exec/hdfs-scan-node.h: Line 108: friend class ScannerContext; // for accessing done_ Any reason we can't just add an accessor? http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: Line 80: // Check valid combinations of add_batches_to_queue_ and scan node type. I made a related comment earlier, but would it work to determine 'add_batches_to_queue_' based on a virtual method of the scan node? PS1, Line 119: iter It's a map entry not an iterator PS1, Line 143: iter This is an map entry not an iterator. 'entry'? http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: Line 208: /// Holds memory for template tuples. It would be helpful to document the lifetime of memory in the pool. One thought I had is whether we should name this based on its lifetime, so it's obvious that it can be used for other allocations with the same lifetime. Line 209: boost::scoped_ptr<MemPool> template_tuple_pool_; It's also a little unfortunate we have to use a whole MemPool for a single tuple - I take it there's no easy solution there? Line 219: /// file. They are owned by the HDFS scan node, although each scanner has its own This ownership has changed, right? Line 228: int tuple_byte_size_; const? Line 231: Tuple* tuple_; This is only used in a couple of subclasses. Maybe move duplicate the member into them to reduce the state in the base class. Line 238: RowBatch* batch_; This is only used for non-MT scanners right? Line 247: int32_t num_null_bytes_; const? And move next to tuple_byte_size_? Line 256: boost::scoped_ptr<Codec> decompressor_; It looks like this decompressor plus the data_buffer_pool_ is only used by some subclasses inheriting from this. It would be nice to get rid of that. I don't think this is essential though. Line 270: typedef int (*WriteTuplesFn)(HdfsScanner*, MemPool*, TupleRow*, int, FieldLocation*, Similarly this WriteTuple stuff is only used by some subclasses, would be nice to simplify the base class so it's easier to understand what state each scanner actually has. Probably not this patch since it's big enough. http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/scanner-context.cc File be/src/exec/scanner-context.cc: Line 316: if (hdfs_scan_node == NULL) return false; To check my understanding - this is no longer relevant because in the single-threaded world we only produce rows on demand in the GetNext() call? And the scanner is closed explicitly when done? http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/scanner-context.h File be/src/exec/scanner-context.h: Line 298: /// If true, the ScanNode has been cancelled and the scanner thread should finish up Update this comment to explain single-threaded behaviour? -- To view, visit http://gerrit.cloudera.org:8080/4113 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I98cc7f970e1575dd83875609985e1877ada3d5e0 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
