Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9757 )

Change subject: IMPALA-6679,IMPALA-6678: reduce scan reservation
......................................................................


Patch Set 10:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/9757/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9757/10//COMMIT_MSG@42
PS10, Line 42: Observability:
> Is there something simple to be said about having stats vs. not having stat
I extended the commit message to address this. The heuristics are pretty 
conservative so it's unlikely that they'll grossly underestimate things unless 
the stats are actually wrong.

If the query is memory-constrained (the only case where a runtime memory 
increase beyond the planner estimate doesn't solve the problem), then the best 
way to improve query performance is by giving it more memory.

Also, if running without stats and the query is memory-constrained, it's 
reasonably likely the memory for a complex query will be wasted elsewhere 
because of over-reservation for joins, aggs, scans, etc, so generally we'd 
expect computing stats to increase the amount of memory available to operators 
that need it.


http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-parquet-scanner.cc@1519
PS10, Line 1519: min_buffer_size, max_buffer_size
> nit: if this is always going to be the min/max buffer_size taken from io_mg
It makes the test in HdfsParquetScannerTest a bit easier to follow and also 
avoids us getting the buffer sizes from DiskIoMgr in this function and the 
caller, so removing the input parameters didn't seem to be a net simplification.


http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-parquet-scanner.cc@1526
PS10, Line 1526: int64_t HdfsParquetScanner::ComputeIdealReservation(
               :     const vector<int64_t>& col_range_lengths) {
               :   DiskIoMgr* io_mgr = ExecEnv::GetInstance()->disk_io_mgr();
               :   int64_t ideal_reservation = 0;
               :   for (int64_t len : col_range_lengths) {
               :     ideal_reservation += 
io_mgr->ComputeIdealBufferReservation(len);
               :   }
               :   return ideal_reservation;
               : }
> nit: now that we have factored out the logic to io_mgr and made this method
I factored it in this way so it could be tested in isolation by 
HdfsParquetScannerTest.


http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-scan-node-base.cc@538
PS10, Line 538:     }
> I think you forgot to refactor this loop? or decided it wasn't worth it?
Ah I got mixed up with what I had done and not done - sorry about that.


http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-scan-node.h
File be/src/exec/hdfs-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-scan-node.h@67
PS10, Line 67: IncreaseReservationForExtraThread()
> outdated comment
Done - good catch


http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-scan-node.h@67
PS10, Line 67: scanner thread is responsible
             : /// for staying within the reservation handed off to it.
> mention that the scanner thread can scale up reservation in case of columna
Done


http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-scan-node.cc@228
PS10, Line 228: buffer_pool_client_.GetReservation()
> curr_reservation
Done


http://gerrit.cloudera.org:8080/#/c/9757/10/be/src/exec/hdfs-scan-node.cc@276
PS10, Line 276:       // The node's min reservation is for the first thread so 
we don't need to check
> nit: maybe move this comment down a line
Done


http://gerrit.cloudera.org:8080/#/c/9757/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/9757/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@781
PS10, Line 781: partitionBytes += fileDesc.getFileLength();
> nit: can move this to the same loop at L786 if you feel it dosent impact co
I remembered why I did this. We use partitionBytes in the call to 
updateMaxScanRangeNumRows() below, so we need to compute it before the first 
iteration of the loop. I factored out the function and made the variable final 
to document the intent a bit better.



--
To view, visit http://gerrit.cloudera.org:8080/9757
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842
Gerrit-Change-Number: 9757
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Bikramjeet Vig <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Thu, 26 Apr 2018 00:49:51 +0000
Gerrit-HasComments: Yes

Reply via email to