Tim Armstrong has posted comments on this change. Change subject: IMPALA-3905: HdfsScanner::GetNext() for Avro, RC, and Seq scans. ......................................................................
Patch Set 2: (5 comments) Responded to comments that needed a response but need to do another pass over the patch to better convince myself there aren't any subtle bugs. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.cc File be/src/exec/base-sequence-scanner.cc: Line 137: static_cast<HdfsScanNode*>(scan_node_)->AddMaterializedRowBatch(row_batch); > Not opposed to the change but want to understand the reasoning a little mor Right, but the transfer of ownership is different between the two cases - I'd consider that part of the interface of the method - the current code is combining two subtly different interfaces into one method. I think that the responsibility for creating the batches does not clearly belong to either the scanner or the scan node is a symptom of that. IMO there should actually be two different Close() interfaces - one for the non-MT scanners with no arguments that creates the batch internally (there's no reason the caller needs to create a batch except to satisfy the combined interface), and one for the MT scanners that attaches things to the caller-owned batch. I think the first can be a wrapper around the second.j http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/base-sequence-scanner.h File be/src/exec/base-sequence-scanner.h: Line 86: virtual Status GetNextInternal(RowBatch* row_batch); > Good idea. Do you mind if I do this in a follow-on change for all scanners? Sounds fine - it would be good though to do the follow-on patch before committing this just to confirm that this patch didn't miss checking the status in any places (it's tricky to check manually as a reviewer). http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-avro-scanner.h File be/src/exec/hdfs-avro-scanner.h: PS2, Line 135: data_ > Renamed to data_block_. Struct does not seem necessary to me for 3 fields, I'm ok with it now that the names are more specific. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-rcfile-scanner.h File be/src/exec/hdfs-rcfile-scanner.h: Line 404: uint8_t* row_group_buffer_; > StringBuffer seems misleading to me because the data is binary. StringBuffe I think maybe StringBuffer should be called MemoryBuffer or something like that - it doesn't really do any string-specific operations, it's just an abstraction over a ptr/len/capacity. I don't feel that strongly and I guess this way we have a more specific error message if we fail to resize the buffer. http://gerrit.cloudera.org:8080/#/c/6527/2/be/src/exec/hdfs-scan-node.h File be/src/exec/hdfs-scan-node.h: Line 113: boost::scoped_ptr<RowBatchQueue> materialized_row_batches_; > As a follow-on? Seems like an independent but fundamental change, want to a If I understood correctly transfer-of-ownership currently is based on unique ownership so if it's implemented correctly it should be a fairly mechanical change. Don't want to add to the scope creep but my motivation is convincing myself that we're not adding any resource transfer bugs (or laying traps for future people to add them). I can do another pass and see if I can convince myself that it's all correct. My concern is that I'm already somewhat familiar with this code and find it difficult to understand and convince myself it's correct. -- To view, visit http://gerrit.cloudera.org:8080/6527 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie18f57b0d3fe0052a8ccd361b6a5fcdf979d0669 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: anujphadke <[email protected]> Gerrit-HasComments: Yes
