Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8966 )
Change subject: IMPALA-4835: Part 3: switch I/O buffers to buffer pool ...................................................................... Patch Set 20: (38 comments) Thanks for all the comments. Did my best to address them. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner-test.cc File be/src/exec/hdfs-parquet-scanner-test.cc: http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner-test.cc@56 PS20, Line 56: 8 > 8? Not sure what I did there. Fixed. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner-test.cc@63 PS20, Line 63: TestDivideReservation({50 * 1024 * 1024, 100 * 1024 * 1024}, 5 * MAX_BUFFER_SIZE, : {2 * MAX_BUFFER_SIZE, 3 * MAX_BUFFER_SIZE}); > maybe a variant on this cases with 5*MAX_BUFFER_SIZE + MIN_BUFFER_SIZE rese Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner-test.cc@71 PS20, Line 71: MIN_BUFFER_SIZE > maybe this case but with 2*MIN_BUFFER_SIZE and that should give 2*MIN_BUFFE Done. Thanks for the suggestions, I added those and a couple more variants. I struggled a bit coming up with ideas for tests here initially. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h@641 PS20, Line 641: between the : /// column scan ranges > maybe just say "between the columns" or "column readers" since this is also Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h@643 PS20, Line 643: col_readers > column_readers Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h@651 PS20, Line 651: column index > it wasn't clear what the column index was until reading the code. Let's exp Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h@688 PS20, Line 688: The collection readers : /// are flattened into collection_readers_. The scalar readers are partitioned into : /// dict_filterable_readers_ and non_dict_filterable_readers_ depending on whether : /// dictionary filtering is enabled and the reader can be dictionary filtered. > update for scalar_readers_ Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@254 PS20, Line 254: // The scanner-wide stream was used only to read the file footer. Each column has added : // its own stream. : stream_ = nullptr; > this seems redundant. Yup http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@682 PS20, Line 682: Prepare dictionary filtering columns for first read > this seems a little weird since it's actually doing the read, not just prep Reworded. Also explicitly mentioned that EvalDictionaryFilters() requires this. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1687 PS20, Line 1687: io_mgr->min_buffer_size(); > how does that relate, if at all, to the buffer pool min buffer size? Since Yeah I agree. Filed IMPALA-6555 for the cleanup. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1689 PS20, Line 1689: if (reservation_to_distribute < min_buffer_size * column_readers.size()) { > Mention this invariant should have been maintained by HdfsScanNode.java Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1752 PS20, Line 1752: bytes_to_add = 0; > just to check my understanding, this is a terminal state for this particula That's correct. Your comment gave me an idea for wording this comment better - I hope it's an improvement. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1756 PS20, Line 1756: bytes_to_add = min(BitUtil::RoundDownToPowerOfTwo(reservation_to_distribute), : max(min_buffer_size, BitUtil::RoundUpToPowerOfTwo(bytes_left_in_range))); > I think that could use a brief explanation Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1766 PS20, Line 1766: } > So I guess when the column chunks are smaller than an max buffer, we'll hav My motivation for all this complication was mostly the case where you have have reasonably wide parquet files with skewed column sizes, which I think is common in practice. E.g. if you have a 256MB Parquet file with 1 128MB column and 50 2.5MB columns, then you don't want the ideal reservation to be 1.2GB, which is much larger than the file size, but you also want to allocate a reasonable amount of reservation to the 128MB column so you can scan it efficiently. The simpler solutions I could see either blew up memory requirements or could end up doing a lot of small I/Os. I added a bit more motivation to a comment above, since it's missing. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.h@338 PS20, Line 338: initialized > Reset()? Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.h@437 PS20, Line 437: . > delete Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.cc@918 PS20, Line 918: ScannerContext* context = parent_->context_; > DCHECK(scan_range_ != nullptr) << "Reset() first" ? Done http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.cc@923 PS20, Line 923: // Note: if we don't need buffers for this scan range, e.g. the file is in the HDFS : // cache, we use the reservation for another purpose. > I'm not sure what this is saying. What other purpose and how does that wor Leftover from an older version where I had ambitions about being more clever with reservation. Deleted the comment. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/scanner-context.h File be/src/exec/scanner-context.h: http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/scanner-context.h@224 PS20, Line 224: have been returned to 'scan_range_' > is it really about "returning to the scan_range_"? Doesn't that potentiall Done. Seems clearer. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/scanner-context.h@370 PS20, Line 370: 'reservation' is the amount of reservation that : /// is given to this stream to use for additional scan ranges, e.g. to read past the : /// end of 'range'. ' > I think this sentence is still confusing/misleading. The explanation above Used the text from above. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/scanner-context.h@373 PS20, Line 373: are freed before 'reservation' is used > that's also misleading since the reservation is already in use. I think the new comment text copied from above makes this clearer. http://gerrit.cloudera.org:8080/#/c/8966/19/be/src/exec/scanner-context.h File be/src/exec/scanner-context.h: http://gerrit.cloudera.org:8080/#/c/8966/19/be/src/exec/scanner-context.h@223 PS19, Line 223: e this until : /// all > Doesn't seem to have changed. Generally, what's the summary of the relatio Now, DiskIoMgr::min_buffer_size() == BufferPool::min_buffer_len() == --min_buffer_size DiskIoMgr::max_buffer_size() == RoundUpToPowerOfTwo(--read_size) This is actually unchanged except for the rounding-up of --read_size. There's now a hard requirement that DiskIoMgr::min_buffer_size() >= BufferPool::min_buffer_len() that wasn't there before (they just happened to be controlled by the same flag before). We could get rid of DiskIoMgr::min_buffer_size() and just use BufferPool::min_buffer_len() as a cleanup. http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/runtime/bufferpool/reservation-tracker-test.cc File be/src/runtime/bufferpool/reservation-tracker-test.cc: http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/runtime/bufferpool/reservation-tracker-test.cc@533 PS20, Line 533: EXPECT_EQ(32 * MEG, ReservationUtil::GetMinMemLimitFromReservation(-1)); > what's the story with these changes? I changed the minimum amount of non-buffer-pool memory per query. The previous value of 75MB was meant to guarantee headroom for non-buffer-pool memory, including I/O buffers. I chose this to avoid regressions (with 1 exception) in test_mem_usage_scaling.py http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@177 PS20, Line 177: // The ideal reservation to process a single input split, >= the minimum reservation. > Actually, we were having trouble distinguishing between "frontend" scan ran As a compromise I switched to saying "scan range (i.e. hdfs split)" in the places that it seemed ambiguous. I noticed that computeScanRangeLocations() already does this, so it seems consistent with the existing terminology. http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@178 PS20, Line 178: // The backend scan node tries to get this amount of reservation per scanner thread, but > Description confuses me because it's not clear how the reservation mechanis Yeah it does seem best to avoid getting into the scanner thread thing here. Used a modified version of your comment. I omitted the second sentence since it seems tangential (it's generally true of minimum reservations, not specifically related to this value). http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@860 PS20, Line 860: if (havePosSlot && numColumns == 0) numColumns = 1; > numColumns = Math.max(numColumns, 1); Done http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1271 PS20, Line 1271: perHostScanRanges = computeNumColumnsReadFromFile(); > perHostScanRanges = numColumnsReadFromFile; Done http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1315 PS20, Line 1315: idealScanRangeReservation_ > is there any way, for debugging, to see this value? should there be? Not right now. I thought about adding it to the explain output. It would be mildly useful for me personally for debugging during development but I wasn't sure if it was useful enough in production to justify adding more clutter to verbose plans. I don't think it answers either of the most important questions: "how much memory does the scan need?" and "how much memory does the scan need to run fast?" http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1335 PS20, Line 1335: // per column in the backend. Row-oriented formats issue one scan range per file. > The row-oriented part is not completely accurate right? The BaseSequenceSca Yeah it was missing the qualifier "at a time". All of them can issue multiple scan ranges to read past the end of the input split. http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1343 PS20, Line 1343: maxScanRangeBytes_ > Precondition that this is >= 0 Done http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test File testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test: http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test@959 PS20, Line 959: # Parquet nested types, left nested in scan - should reserve memory for nested column I came up with a couple more permutations on the nested types tests after looking at the code again: * materializing a nested column inside a collection, but no top level column. * materializing two levels of nested. I also think the DISTRIBUTEDPLAN and PARALLELPLANS sections aren't that useful for all the permutations of the scan tests, so I removed a lot of them to reduce the size of the test file. http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/admission-reject-min-reservation.test File testdata/workloads/functional-query/queries/QueryTest/admission-reject-min-reservation.test: http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/admission-reject-min-reservation.test@8 PS20, Line 8: 100 > why did this decrease? Relative to previous patchsets? The 256KB min reservation per column previously came into play, since the files involved are very small. Now that the min reservation is 8KB, the reservation decreased. http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/admission-reject-min-reservation.test@32 PS20, Line 32: > and these two See above http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/analytic-fns.test File testdata/workloads/functional-query/queries/QueryTest/analytic-fns.test: http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/analytic-fns.test@1587 PS20, Line 1587: 7MB extra for I/O buffers) > how did you determine that? is that the min reservation computed by the new Yeah, based on the plan. I didn't update this since the previous changes. It's now a much lower value. http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/codegen-mem-limit.test File testdata/workloads/functional-query/queries/QueryTest/codegen-mem-limit.test: http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/codegen-mem-limit.test@6 PS20, Line 6: with t as (values(1), (2), (3), (4)) select min(t1.`1` + t2.`1`) from t t1 join t t2; > what the reason behind this change? The old query now fails to get reservation and fails earlier than before, instead of during codegen as intended. I needed a query without a scan in it to actually hit memory limit exceeded at the right place. http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test File testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test: http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test@a165 PS20, Line 165: > don't we still want to do that? I filed IMPALA-6560 to do that. It might need some specially crafted data now to do it non-flakily. http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test File testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test: http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test@7 PS20, Line 7: set debug_action="-1:OPEN:SET_DENY_RESERVATION_PROBABILITY@1.0"; > give the name of the file, this is surprising. Can we add a file level comm Updated comments here and in Python to clarify. http://gerrit.cloudera.org:8080/#/c/8966/20/tests/query_test/test_query_mem_limit.py File tests/query_test/test_query_mem_limit.py: http://gerrit.cloudera.org:8080/#/c/8966/20/tests/query_test/test_query_mem_limit.py@122 PS20, Line 122: num_nodes=1 > why do we do that? Added a comment. Without it the fragments raced to allocate memory and it usually failed somewhere else. -- To view, visit http://gerrit.cloudera.org:8080/8966 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786 Gerrit-Change-Number: 8966 Gerrit-PatchSet: 20 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Thu, 22 Feb 2018 01:13:13 +0000 Gerrit-HasComments: Yes