[Impala-ASF-CR] IMPALA-3905: Add single-threaded scan node.

Tim Armstrong (Code Review) Thu, 25 Aug 2016 09:57:59 -0700

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3905: Add single-threaded scan node.
......................................................................



Patch Set 1:

(22 comments)

I did an initial pass. I still need to think through some of the details but I 
thought I'd flush out these comments.

So far everything seems sane :)

http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-avro-scanner.h
File be/src/exec/hdfs-avro-scanner.h:

PS1, Line 93: bool add_batches_to_queue
This seems somewhat redundant with the 'scan_node' parameter. Can we add a 
virtual method to HdfsScanNodeBase to determine if it's the single-threaded or 
multi-threaded version?


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

Line 678:     DCHECK(value_ctx->is_clone());
This DCHECK needs a little explanation.


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 69:   HdfsFileDesc(const std::string& filename)
Space before function?

I think we should also put the constructors up the top while we're here. This 
order is non-standard.


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-mt.cc
File be/src/exec/hdfs-scan-node-mt.cc:

Line 72:       scanner_->Close(row_batch);
We can reset 'scanner_' at this point, right?


Line 75:     
RETURN_IF_ERROR(runtime_state_->io_mgr()->GetNextRange(reader_context_, 
&scan_range_));
Long line.


Line 93:   num_rows_returned_ += row_batch->num_rows();
I think we should rework the control flow to avoid executing this big block of 
code when status is non-ok.

E.g. one concern is that it would be easy to add something in there that 
overwrote the non-ok status with an ok status.


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node-mt.h
File be/src/exec/hdfs-scan-node-mt.h:

Line 52:   bool scanner_eos_;
This is the pattern we use elsewhere in the code, but IMO this would be simpler 
if the scanner has an 'eos_' field and 'eos()' method instead of relying on the 
caller to pass in an argument and save it.

That doesn't really add any more state to the scanner, because it already had 
an implicit 'eos' state.


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scan-node.h
File be/src/exec/hdfs-scan-node.h:

Line 108:   friend class ScannerContext; // for accessing done_
Any reason we can't just add an accessor?


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

Line 80:   // Check valid combinations of add_batches_to_queue_ and scan node 
type.
I made a related comment earlier, but would it work to determine 
'add_batches_to_queue_' based on a virtual method of the scan node?


PS1, Line 119: iter
It's a map entry not an iterator


PS1, Line 143: iter
This is an map entry not an iterator. 'entry'?


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

Line 208:   /// Holds memory for template tuples.
It would be helpful to document the lifetime of memory in the pool.

One thought I had is whether we should name this based on its lifetime, so it's 
obvious that it can be used for other allocations with the same lifetime.


Line 209:   boost::scoped_ptr<MemPool> template_tuple_pool_;
It's also a little unfortunate we have to use a whole MemPool for a single 
tuple - I take it there's no easy solution there?


Line 219:   /// file.  They are owned by the HDFS scan node, although each 
scanner has its own
This ownership has changed, right?


Line 228:   int tuple_byte_size_;
const?


Line 231:   Tuple* tuple_;
This is only used in a couple of subclasses. Maybe move duplicate the member 
into them to reduce the state in the base class.


Line 238:   RowBatch* batch_;
This is only used for non-MT scanners right?


Line 247:   int32_t num_null_bytes_;
const? And move next to tuple_byte_size_?


Line 256:   boost::scoped_ptr<Codec> decompressor_;
It looks like this decompressor plus the data_buffer_pool_ is only used by some 
subclasses inheriting from this. It would be nice to get rid  of that. I don't 
think this is essential though.


Line 270:   typedef int (*WriteTuplesFn)(HdfsScanner*, MemPool*, TupleRow*, 
int, FieldLocation*,
Similarly this WriteTuple stuff is only used by some subclasses, would be nice 
to simplify the base class so it's easier to understand what state each scanner 
actually has.

Probably not this patch since it's big enough.


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

Line 316:   if (hdfs_scan_node == NULL) return false;
To check my understanding - this is no longer relevant because in the 
single-threaded world we only produce rows on  demand in the GetNext() call? 
And the scanner is closed explicitly when done?


http://gerrit.cloudera.org:8080/#/c/4113/1/be/src/exec/scanner-context.h
File be/src/exec/scanner-context.h:

Line 298:   /// If true, the ScanNode has been cancelled and the scanner thread 
should finish up
Update this comment to explain single-threaded behaviour?


-- 
To view, visit http://gerrit.cloudera.org:8080/4113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I98cc7f970e1575dd83875609985e1877ada3d5e0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3905: Add single-threaded scan node.

Reply via email to