Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 )
Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls ...................................................................... Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG@24 PS4, Line 24: datastructures nit: data structures http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h File be/src/scheduling/query-schedule.h: http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h@260 PS4, Line 260: // Map of plan node to scan ranges that are computed from request_. If a plan node is : // not present, then it has no associated additional scan ranges. : // Populated by the ComputeScanRanges. : std::map<PlanNodeId, vector<TScanRangeLocationList>> computed_ranges_; How is this related to PerNodeScanRanges (L41)? http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc@258 PS4, Line 258: // Combine actual and computed ranges from the schedule. : const vector<TScanRangeLocationList>* scan_ranges = &entry.second; : const vector<TScanRangeLocationList>* computed_ranges = : schedule->GetComputedRanges(entry.first); : vector<TScanRangeLocationList> combined_ranges; : if (computed_ranges != nullptr) { : for (const auto& range : entry.second) { : if (!range.scan_range.__isset.hdfs_file_split_spec) { : combined_ranges.push_back(range); : } : } : for (const auto& range : (*schedule->GetComputedRanges(entry.first))) { : combined_ranges.push_back(range); : } : scan_ranges = &combined_ranges; : } I think my previous comment made things worse, sorry about that. I think it's fine to move the scan range generation here. In that way, we will avoid the double iteration over all plan nodes and the additional state (per node computed scan ranges). Just make sure you wrap this into a separate function. http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@199 PS4, Line 199: 3: required i64 partition_id Comment what this is referencing. http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@209 PS4, Line 209: 4: optional THdfsFileSplitSpec hdfs_file_split_spec Add a comment either here or at the top to describe when THdfsFileSplit vs THdfsFileSplitSpec is set. Actually, THdfsFileSplitSpec may not be the best name. How about THdfsFileSplitGenerationSpec or something like that? http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@126 PS4, Line 126: fileFormat' nit: missing quote http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@712 PS4, Line 712: fileDesc.getNumFileBlocks() What will happen if this is a zero-byte file (no blocks). Can't we prune this early on or it never reaches that point? http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@743 PS4, Line 743: partitionFs.getDefaultBlockSize(new Path(fileDesc.getFileName()) Do we need to do this for every file? Can't we do this once per partition? http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@761 PS4, Line 761: disk ids are expected in blocks, : * checkMissingDiskIds should be set to true. If disk ids are checked, "checkMissingDiskIds is true, " -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 4 Gerrit-Owner: Vuk Ercegovac <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Vuk Ercegovac <[email protected]> Gerrit-Comment-Date: Wed, 17 Jan 2018 19:30:45 +0000 Gerrit-HasComments: Yes
