Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9757 )
Change subject: IMPALA-6679,IMPALA-6678: reduce scan reservation ...................................................................... Patch Set 6: (35 comments) I desperately need to rebase to be able to run tests (running into pypi issues). I haven't addressed Bikram's requests about added tests yet, but I've addressed everything else, so I'll push out my current changes, rebase and then do another pass over comments to make sure that I addressed everything. http://gerrit.cloudera.org:8080/#/c/9757/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9757/6//COMMIT_MSG@37 PS6, Line 37: ideal reservation > above you say we no longer need a concept of ideal reservation -- so what i We still have the concept of an ideal reservation, it's just computed once we know the file layout rather than estimated in the planner. I didn't intend the above comment to mean that - reworded to reduce ambiguity. http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-parquet-scanner-test.cc File be/src/exec/hdfs-parquet-scanner-test.cc: http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-parquet-scanner-test.cc@45 PS6, Line 45: // Should round up to nearest power-of-two buffer size if < max scan range buffer > maybe add a case that is somewhere in between min and max buffer Added an extra test. I think MIN_BUFFER_SIZE + 2 sort-of covered this but the extra coverage doesn't hurt. http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-parquet-scanner.cc@1699 PS6, Line 1699: ideal_reservation += min(BitUtil::RoundUpToPowerOf2(len, max_buffer_size), : DiskIoMgr::IDEAL_MAX_SIZED_BUFFERS_PER_SCAN_RANGE * max_buffer_size); > maybe mention here briefly why we increment in steps of max_buffer_size if I realised that is is duplicating a calculation from HdfsScanNodeBase. I factored that into DiskIoMgr since it's closely tied to IDEAL_MAX_SIZED_BUFFERS_PER_RANGE and other details of the IoMgr's buffer management. http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node-base.cc@527 PS6, Line 527: int64_t target = BitUtil::RoundUpToPowerOf2(*reservation, max_buffer_size); : if (*reservation == target) { : // If it's already an I/O buffer multiple, increase by an I/O buffer or to ideal. : target = min(*reservation + max_buffer_size, ideal_scan_range_reservation); : } > how about That's a nicer way to express it. http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.h File be/src/exec/hdfs-scan-node.h: http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.h@178 PS6, Line 178: /// this thread with DeductReservationForScannerThread(). > maybe mention that it releases Reservation and only holds on to minimum res Good point. I referenced ReturnReservationFromScannerThread() instead of describing the behaviour of ReturnReservationFromScannerThread() http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.h@204 PS6, Line 204: or > nit: return all Done http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc File be/src/exec/hdfs-scan-node.cc: http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc@226 PS6, Line 226: spare_reservation_ > what if between line 225 and 226, some scanner thread releases its reservat We're actually holding the lock everywhere that we touch spare_reservation_ so the race isn't possible. Now that we're doing that, it's actually pointless to use an atomic for spare_reservation_ so I'll fix that. http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc@392 PS6, Line 392: // Take a snapshot of num_unqueued_files_ before calling GetNextUnstartedRange(). : // We don't want num_unqueued_files_ to go to zero between the return from : // GetNextUnstartedRange() and the check for when all ranges are complete. > update comment: replace GetNextUnstartedRange with StartNextScanRange. Done. Below it seemed better to describe the logical condition rather than the NULLness of the pointer. http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc@481 PS6, Line 481: *scanner_thread_reservation, partition, filter_ctxs, expr_results_pool); : context.AddStream(scan_range, *scanner_thread_reservation); > reservation passed twice. can we remove redundancy by using context->total_ There are multiple streams per reservation so it's not generally true that the stream's reservation is equal to the total reservation. http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc@518 PS6, Line 518: // Reservation may have been increased by the scanner. > maybe mention, that this happens only in case of column formats like parque I don't think this is the right place to document global assumptions about what scanner subclasses can/can't do, but it is helpful to add an example to make it clearer why it may change. http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/scanner-context.h File be/src/exec/scanner-context.h: http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/scanner-context.h@379 PS6, Line 379: o > nit: to allow Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@150 PS6, Line 150: // Default reservation for a backend scan range for a column in columnar formats like > should we mention the units just to ward off any ambiguity? Seems reasonable to do. I think we've been consistent about expressing reservation in bytes, so it doesn't seem strictly necessary. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@150 PS6, Line 150: backend > IOMgr is the terminology we use below (and I think that's clearer). Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@154 PS6, Line 154: // TODO: is it worth making this a tunable query option? > not a huge fan of adding more knobs that the user doesnt know how to tune. I think it would just be a safety valve if the planner made a bad decision. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@157 PS6, Line 157: FOOTER_SIZE > maybe name both the same so that a grep will also find both. Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@158 PS6, Line 158: FOOTER_READ_SIZE > nit: how about PARQUET_FOOTER_READ_SIZE, just to make it explicit I'm actually going to have to update this anyway when I rebase since the ORC patch uses the same number for the ORC footer, so I'll wait until the rebase to adjust this. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@752 PS6, Line 752: maxScanRangeBytes_ > confusing at first when you look at this and maxScanRangeLength, Renamed to scanRangeBytesLimit and largestScanRangeBytes http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1264 PS6, Line 1264: fileFormats_.contains(HdfsFileFormat.PARQUET) > Do we have enough testing for the multiple format case when it comes to res Probably not.. we only have one table which is very basic. There's a single planner test. I moved the end-to-end test to test_scanners.py and run it with some more test dimensions. It seems like we need a larger table that includes some parquet files though. I filed IMPALA-6874 to keep track of this. It would be good to add some more coverage to exercise this code but might be best to not roll it into this change. I'll start working on that in a separate patchset. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1264 PS6, Line 1264: fileFormats_.contains(HdfsFileFormat.PARQUET) > will estimates be biased towards parquet if there are multiple formats? The Parquet estimates will generally be higher, so I think this is conservative. We could make it more accurate in some cases by factoring in the numbers of files of different formats and their sizes but I don't think the mixed format case is common enough to justify it. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1320 PS6, Line 1320: Bound the reservation : * based on: > is the following list a lower bound, upper bound, or both? Restructured to separate upper and lower bounds. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1326 PS6, Line 1326: . > and is available. or .. try to increase .. Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1337 PS6, Line 1337: columnReservations != null > since != and == have same precedence, let's parenthesize this for clarity. Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1373 PS6, Line 1373: > memory Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1374 PS6, Line 1374: Returns one value per column. > like you say below, these values aren't yet quantized the buffer sizes, rig Updated comment. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1378 PS6, Line 1378: * logic for non-shredded columnar formats if added. > generally, do we need to worry about orc for this patch? I haven't rebased onto ORC yet, so not for the current patchset. I think I need to make a bit more progress on that before I know exactly what changes. With the current state of the ORC patch, it only does one I/O at a time so the Parquet logic doesn't apply. I'm going to take a look to see if we can switch ORC to using multiple parallel scan ranges, in which case the logic will be the same as Parquet for non-nested cases, which is all we support for now. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1394 PS6, Line 1394: // Not a top-level column, e.g. a nested collection - no stats available. > what does this case mean? scalar but not top level but also different than This is the case where the value is logically inside a nested collection but is being unnested into a top-level slot by the scan. Updated the comment to be hopefully clearer. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1417 PS6, Line 1417: compute > since this one appends (rather than returns like the other functions) maybe Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1440 PS6, Line 1440: column > scalar column? Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1443 PS6, Line 1443: estimates > nit: repeated word Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1444 PS6, Line 1444: In that case the reservation : * may still be usable for a different column. > you mean in the case where we underestimate it for another column but try t Yeah, I was trying to make the point that overestimates can be cancelled out by underestimates (or columns > 4MB where we reserve less than the column size). Without that, I think readers might reasonably assume that overestimates result in memory being wasted. Tried to reword the comment to be a little clearer. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1448 PS6, Line 1448: computeMinColumnReservation > should this be computeMinColumnReservationForScalar() or similar? Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1465 PS6, Line 1465: encodedDataSize > nit: encodedDataByteSize Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1469 PS6, Line 1469: return estimate; > have you verified how close this comes out to the actual size for various d I mainly looked at TPC-H data, e.g. https://gist.github.com/timarmstrong/5df96c2c8d0b10f78e3c8e94362b5aee . It was very accurate for low-NDV columns, e.g. l_returnflag. l_orderkey threw it off because it's high-NDV but is highly compressible because the file is clustered by l_orderkey. This is probably somewhat ideal, in that the data is uniformly distributed between files and fairly regular. For messier cases there's probably more of a tendency to overestimate. I added stats counter for ideal and actual reservations for Parquet. I think that should indicate if there's a systematic problem of over/under reservation for a scan. http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1472 PS6, Line 1472: private static long roundUpToIoBuffer(long bytes, long maxIoBufferSize) { > add a quick function comment. Done http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1475 PS6, Line 1475: BitUtil.roundUpToPowerOf2Factor(bytes, maxIoBufferSize); > why is it the right thing to always have a multiple of maxIoBufferSize in t Updated the fn comment to provide some motivation. -- To view, visit http://gerrit.cloudera.org:8080/9757 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842 Gerrit-Change-Number: 9757 Gerrit-PatchSet: 6 Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Wed, 18 Apr 2018 18:40:46 +0000 Gerrit-HasComments: Yes
