Dimitris Tsirogiannis has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
......................................................................


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG@24
PS4, Line 24: datastructures
nit: data structures


http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h
File be/src/scheduling/query-schedule.h:

http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h@260
PS4, Line 260: // Map of plan node to scan ranges that are computed from 
request_. If a plan node is
             :   // not present, then it has no associated additional scan 
ranges.
             :   // Populated by the ComputeScanRanges.
             :   std::map<PlanNodeId, vector<TScanRangeLocationList>> 
computed_ranges_;
How is this related to PerNodeScanRanges (L41)?


http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc@258
PS4, Line 258: // Combine actual and computed ranges from the schedule.
             :       const vector<TScanRangeLocationList>* scan_ranges = 
&entry.second;
             :       const vector<TScanRangeLocationList>* computed_ranges =
             :           schedule->GetComputedRanges(entry.first);
             :       vector<TScanRangeLocationList> combined_ranges;
             :       if (computed_ranges != nullptr) {
             :         for (const auto& range : entry.second) {
             :           if (!range.scan_range.__isset.hdfs_file_split_spec) {
             :             combined_ranges.push_back(range);
             :           }
             :         }
             :         for (const auto& range : 
(*schedule->GetComputedRanges(entry.first))) {
             :           combined_ranges.push_back(range);
             :         }
             :         scan_ranges = &combined_ranges;
             :       }
I think my previous comment made things worse, sorry about that. I think it's 
fine to move the scan range generation here. In that way, we will avoid the 
double iteration over all plan nodes and the additional state (per node 
computed scan ranges). Just make sure you wrap this into a separate function.


http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@199
PS4, Line 199: 3: required i64 partition_id
Comment what this is referencing.


http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@209
PS4, Line 209: 4: optional THdfsFileSplitSpec hdfs_file_split_spec
Add a comment either here or at the top to describe when THdfsFileSplit vs 
THdfsFileSplitSpec is set. Actually, THdfsFileSplitSpec may not be the best 
name. How about THdfsFileSplitGenerationSpec or something like that?


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@126
PS4, Line 126: fileFormat'
nit: missing quote


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@712
PS4, Line 712: fileDesc.getNumFileBlocks()
What will happen if this is a zero-byte file (no blocks). Can't we prune this 
early on or it never reaches that point?


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@743
PS4, Line 743: partitionFs.getDefaultBlockSize(new Path(fileDesc.getFileName())
Do we need to do this for every file? Can't we do this once per partition?


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@761
PS4, Line 761: disk ids are expected in blocks,
             :    * checkMissingDiskIds should be set to true. If disk ids are 
checked,
"checkMissingDiskIds is true, "



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 4
Gerrit-Owner: Vuk Ercegovac <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>
Gerrit-Comment-Date: Wed, 17 Jan 2018 19:30:45 +0000
Gerrit-HasComments: Yes

Reply via email to