Tim Armstrong has posted comments on this change. Change subject: IMPALA-3905: Add HdfsScanner::GetNext() interface and implementation for Parquet. ......................................................................
Patch Set 2: (19 comments) http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/base-sequence-scanner.cc File be/src/exec/base-sequence-scanner.cc: Line 139: scan_node_->ThreadTokenAvailableCb(state_->resource_pool()); Has this just been moved from somewhere else? I don't understand this part of the change. http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 139: skip_row_group_(true), It looks like it's set to false before it's used anyway, but it's counterintuitive to initialise this to true. Line 170: DCHECK(parse_status_.ok()) << "Invalid parse_status_" << parse_status_.GetDetail(); It seems like a lot of the below work corresponds more to the Open() phase of exec nodes instead of Prepare(). Maybe we should rename HdfsScanner::Prepare() to HdfsScanner::Open() that that the lifecycle lines up closer to the ExecNode lifecycle? PS2, Line 236: NULL Passing NULL argument to reset() is redundant. Line 310: assemble_rows_timer_.Start(); Seems easier to use a scoped timer here and avoid splitting the function call and RETURN_IF_ERROR below. Line 312: int max_tuples = min<int64_t>(row_batch->capacity(), rows_remaining); Why the int64_t here? Everything else involved is an int. Line 325: assemble_rows_timer_.Start(); SCOPED_TIMER? Line 354: if (!scan_node_->PartitionPassesFilterPredicates( Wrapping seems weird here - ok if it's just a unfortunate consequence of the line lengths. Line 402: // This row group will be handled by a different scanner. Maybe explain briefly why? It is because it's included in the previous or next split? PS2, Line 409: seeding_succeeded seeded_ok? It's bit of a mouthful. Line 439: void HdfsParquetScanner::TransferResources(RowBatch* row_batch) { It's not immediately clear what resources this is transferring and when it's safe to call it. Line 443: for (ParquetColumnReader* r: column_readers_) r->TransferResources(row_batch); I don't feel strongly about this, but we don't generally do one-line for statements like this. PS2, Line 514: scan node Update comment - there won't necessarily be a scan node. http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/hdfs-scan-node.cc File be/src/exec/hdfs-scan-node.cc: Line 270: ThreadTokenAvailableCb(runtime_state_->resource_pool()); I'm not sure that I understand the intended effect of these ThreadTokenAvailableCb() changes. Was it just refactoring? Line 876: Extraneous new line. http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: Line 114: /// The memory referenced by tuples in the returned batch is valid until Close(). This isn't the guarantee that ExecNode::GetNext() provides (actually I'm not sure if these scanners guarantee it) and I don't think it's the one we want. I'm reading this as that you can assume that any memory referenced from the first batch returned is valid up until you call Close(), regardless of what you do with the batch. The guarantee is more like "The memory referenced by the tuples is valid until either Close() is called, the batch is reset or destroyed, or a batch returned from a subsequent GetNext() call is reset or destroyed." PS2, Line 115: bool* eos One idea I've thought about for ExecNode is that it might be simpler to make eos_ a member, rather than returning it from GetNext(). This might be a good place to test the idea if you think it is worthwhile. When it's an output parameter it gets a little finicky trying to make sure that it's set correctly on all code paths and also making sure that the caller doesn't lose track of whether it got an 'eos' flag or not. Line 124: virtual void TransferResources(RowBatch* row_batch) { I'm not sure that it makes sense to make this as part of the HdfsScanner interface since the resource management of each scanner is somewhat particular to the file format. I can't think of a scenario when it would make sense for an external class to call this, since there's no general rule about when it's safe to transfer resources. E.g. in Parquet it would probably be clearer to have a TransferRowGroupResources() function. http://gerrit.cloudera.org:8080/#/c/3732/2/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 228: virtual void TransferResources(RowBatch* row_batch) = 0; When is it safe to call this? -- To view, visit http://gerrit.cloudera.org:8080/3732 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iab50770bac05afcda4d3404fb4f53a0104931eb0 Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
