Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3905: HdfsScanner::GetNext() for Avro, RC, and Seq scans.
......................................................................


Patch Set 2:

(5 comments)

Responded to comments that needed a response but need to do another pass over 
the patch to better convince myself there aren't any subtle bugs.

http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 137:       
static_cast<HdfsScanNode*>(scan_node_)->AddMaterializedRowBatch(row_batch);
> Not opposed to the change but want to understand the reasoning a little mor
Right, but the transfer of ownership is different between the two cases - I'd 
consider that part of the interface of the method - the current code is 
combining two subtly different interfaces into one method.

I think that the responsibility for creating the batches does not clearly 
belong to either the scanner or the scan node is a symptom of that.

IMO there should actually be two different Close() interfaces - one for the 
non-MT scanners with no arguments that creates the batch internally (there's no 
reason the caller needs to create a batch except to satisfy the combined 
interface), and one for the MT scanners that attaches things to the 
caller-owned batch. I think the first can be a wrapper around the second.j


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.h
File be/src/exec/base-sequence-scanner.h:

Line 86:   virtual Status GetNextInternal(RowBatch* row_batch);
> Good idea. Do you mind if I do this in a follow-on change for all scanners?
Sounds fine - it would be good though to do the follow-on patch before 
committing this just to confirm that this patch didn't miss checking the status 
in any places (it's tricky to check manually as a reviewer).


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-avro-scanner.h
File be/src/exec/hdfs-avro-scanner.h:

PS2, Line 135: data_
> Renamed to data_block_. Struct does not seem necessary to me for 3 fields, 
I'm ok with it now that the names are more specific.


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-rcfile-scanner.h
File be/src/exec/hdfs-rcfile-scanner.h:

Line 404:   uint8_t* row_group_buffer_;
> StringBuffer seems misleading to me because the data is binary. StringBuffe
I think maybe StringBuffer should be called MemoryBuffer or something like that 
- it doesn't really do any string-specific operations, it's just an abstraction 
over a ptr/len/capacity. I don't feel that strongly and I guess this way we 
have a more specific error message if we fail to resize the buffer.


http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-scan-node.h
File be/src/exec/hdfs-scan-node.h:

Line 113:   boost::scoped_ptr<RowBatchQueue> materialized_row_batches_;
> As a follow-on? Seems like an independent but fundamental change, want to a
If I understood correctly transfer-of-ownership currently is based on unique 
ownership so if it's implemented correctly it should be a fairly mechanical 
change. Don't want to add to the scope creep but my motivation is convincing 
myself that we're not adding any resource transfer bugs (or laying traps for 
future people to add them).

I can do another pass and see if I can convince myself that it's all correct. 
My concern is that I'm already somewhat familiar with this code and find it 
difficult to understand and convince myself it's correct.


-- 
To view, visit http://gerrit.cloudera.org:8080/6527
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie18f57b0d3fe0052a8ccd361b6a5fcdf979d0669
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: anujphadke <[email protected]>
Gerrit-HasComments: Yes

Reply via email to