[Impala-ASF-CR] IMPALA-4835: Part 3: switch I/O buffers to buffer pool

Tim Armstrong (Code Review) Wed, 21 Feb 2018 17:14:25 -0800

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8966 )


Change subject: IMPALA-4835: Part 3: switch I/O buffers to buffer pool
......................................................................


Patch Set 20:

(38 comments)

Thanks for all the comments. Did my best to address them.

http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner-test.cc
File be/src/exec/hdfs-parquet-scanner-test.cc:

http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner-test.cc@56
PS20, Line 56: 8
> 8?
Not sure what I did there. Fixed.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner-test.cc@63
PS20, Line 63:   TestDivideReservation({50 * 1024 * 1024, 100 * 1024 * 1024}, 5 
* MAX_BUFFER_SIZE,
             :         {2 * MAX_BUFFER_SIZE, 3 * MAX_BUFFER_SIZE});
> maybe a variant on this cases with 5*MAX_BUFFER_SIZE + MIN_BUFFER_SIZE rese
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner-test.cc@71
PS20, Line 71: MIN_BUFFER_SIZE
> maybe this case but with 2*MIN_BUFFER_SIZE and that should give 2*MIN_BUFFE
Done. Thanks for the suggestions, I added those and a couple more variants. I 
struggled a bit coming up with ideas for tests here initially.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h@641
PS20, Line 641: between the
              :   /// column scan ranges
> maybe just say "between the columns" or "column readers" since this is also
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h@643
PS20, Line 643: col_readers
> column_readers
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h@651
PS20, Line 651: column index
> it wasn't clear what the column index was until reading the code. Let's exp
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.h@688
PS20, Line 688: The collection readers
              :   /// are flattened into collection_readers_. The scalar 
readers are partitioned into
              :   /// dict_filterable_readers_ and non_dict_filterable_readers_ 
depending on whether
              :   /// dictionary filtering is enabled and the reader can be 
dictionary filtered.
> update for scalar_readers_
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@254
PS20, Line 254:   // The scanner-wide stream was used only to read the file 
footer.  Each column has added
              :   // its own stream.
              :   stream_ = nullptr;
> this seems redundant.
Yup


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@682
PS20, Line 682: Prepare dictionary filtering columns for first read
> this seems a little weird since it's actually doing the read, not just prep
Reworded. Also explicitly mentioned that EvalDictionaryFilters() requires this.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1687
PS20, Line 1687: io_mgr->min_buffer_size();
> how does that relate, if at all, to the buffer pool min buffer size? Since
Yeah I agree. Filed IMPALA-6555 for the cleanup.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1689
PS20, Line 1689:   if (reservation_to_distribute < min_buffer_size * 
column_readers.size()) {
> Mention this invariant should have been maintained by HdfsScanNode.java
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1752
PS20, Line 1752:           bytes_to_add = 0;
> just to check my understanding, this is a terminal state for this particula
That's correct. Your comment gave me an idea for wording this comment better - 
I hope it's an improvement.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1756
PS20, Line 1756:         bytes_to_add = 
min(BitUtil::RoundDownToPowerOfTwo(reservation_to_distribute),
               :             max(min_buffer_size, 
BitUtil::RoundUpToPowerOfTwo(bytes_left_in_range)));
> I think that could use a brief explanation
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/hdfs-parquet-scanner.cc@1766
PS20, Line 1766:     }
> So I guess when the column chunks are smaller than an max buffer, we'll hav
My motivation for all this complication was mostly the case where you have have 
reasonably wide parquet files with skewed column sizes, which I think is common 
in practice. E.g. if you have a 256MB Parquet file with 1 128MB column and 50 
2.5MB columns, then you don't want the ideal reservation to be 1.2GB, which is 
much larger than the file size, but you also want to allocate a reasonable 
amount of reservation to the 128MB column so you can scan it efficiently.

The simpler solutions I could see either blew up memory requirements or could 
end up doing a lot of small I/Os.

I added a bit more motivation to a comment above, since it's missing.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.h@338
PS20, Line 338: initialized
> Reset()?
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.h@437
PS20, Line 437: .
> delete
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.cc@918
PS20, Line 918:   ScannerContext* context = parent_->context_;
> DCHECK(scan_range_ != nullptr) << "Reset() first" ?
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/parquet-column-readers.cc@923
PS20, Line 923:   // Note: if we don't need buffers for this scan range, e.g. 
the file is in the HDFS
              :   // cache, we use the reservation for another purpose.
> I'm not sure what this is saying.  What other purpose and how does that wor
Leftover from an older version where I had ambitions about being more clever 
with reservation. Deleted the comment.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/scanner-context.h
File be/src/exec/scanner-context.h:

http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/scanner-context.h@224
PS20, Line 224: have been returned to 'scan_range_'
> is it really about "returning to the scan_range_"?  Doesn't that potentiall
Done. Seems clearer.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/scanner-context.h@370
PS20, Line 370: 'reservation' is the amount of reservation that
              :   /// is given to this stream to use for additional scan 
ranges, e.g. to read past the
              :   /// end of 'range'. '
> I think this sentence is still confusing/misleading.  The explanation above
Used the text from above.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/exec/scanner-context.h@373
PS20, Line 373: are freed before 'reservation' is used
> that's also misleading since the reservation is already in use.
I think the new comment text copied from above makes this clearer.


http://gerrit.cloudera.org:8080/#/c/8966/19/be/src/exec/scanner-context.h
File be/src/exec/scanner-context.h:

http://gerrit.cloudera.org:8080/#/c/8966/19/be/src/exec/scanner-context.h@223
PS19, Line 223: e this until
              :     /// all
> Doesn't seem to have changed.  Generally, what's the summary of the relatio
Now, DiskIoMgr::min_buffer_size() == BufferPool::min_buffer_len() == 
--min_buffer_size

DiskIoMgr::max_buffer_size() == RoundUpToPowerOfTwo(--read_size)

This is actually unchanged except for the rounding-up of --read_size. There's 
now a hard requirement that DiskIoMgr::min_buffer_size() >= 
BufferPool::min_buffer_len() that wasn't there before (they just happened to be 
controlled by the same flag before).

We could get rid of DiskIoMgr::min_buffer_size() and just use 
BufferPool::min_buffer_len() as a cleanup.


http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/runtime/bufferpool/reservation-tracker-test.cc
File be/src/runtime/bufferpool/reservation-tracker-test.cc:

http://gerrit.cloudera.org:8080/#/c/8966/20/be/src/runtime/bufferpool/reservation-tracker-test.cc@533
PS20, Line 533:   EXPECT_EQ(32 * MEG, 
ReservationUtil::GetMinMemLimitFromReservation(-1));
> what's the story with these changes?
I changed the minimum amount of non-buffer-pool memory per query. The previous 
value of 75MB was meant to guarantee headroom for non-buffer-pool memory, 
including I/O buffers.

I chose this to avoid regressions (with 1 exception) in 
test_mem_usage_scaling.py


http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@177
PS20, Line 177:   // The ideal reservation to process a single input split, >= 
the minimum reservation.
> Actually, we were having trouble distinguishing between "frontend" scan ran
As a compromise I switched to saying "scan range (i.e. hdfs split)" in the 
places that it seemed ambiguous. I noticed that computeScanRangeLocations() 
already does this, so it seems consistent with the existing terminology.


http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@178
PS20, Line 178:   // The backend scan node tries to get this amount of 
reservation per scanner thread, but
> Description confuses me because it's not clear how the reservation mechanis
Yeah it does seem best to avoid getting into the scanner thread thing here. 
Used a modified version of your comment. I omitted the second sentence since it 
seems tangential (it's generally true of minimum reservations, not specifically 
related to this value).


http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@860
PS20, Line 860:     if (havePosSlot && numColumns == 0) numColumns = 1;
> numColumns = Math.max(numColumns, 1);
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1271
PS20, Line 1271:       perHostScanRanges = computeNumColumnsReadFromFile();
> perHostScanRanges = numColumnsReadFromFile;
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1315
PS20, Line 1315: idealScanRangeReservation_
> is there any way, for debugging, to see this value? should there be?
Not right now. I thought about adding it to the explain output. It would be 
mildly useful for me personally for debugging during development but I wasn't 
sure if it was useful enough in production to justify adding more clutter to 
verbose plans.

I don't think it answers either of the most important questions: "how much 
memory does the scan need?" and "how much memory does the scan need to run 
fast?"


http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1335
PS20, Line 1335:     // per column in the backend. Row-oriented formats issue 
one scan range per file.
> The row-oriented part is not completely accurate right? The BaseSequenceSca
Yeah it was missing the qualifier "at a time". All of them can issue multiple 
scan ranges to read past the end of the input split.


http://gerrit.cloudera.org:8080/#/c/8966/20/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1343
PS20, Line 1343: maxScanRangeBytes_
> Precondition that this is >= 0
Done


http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test:

http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test@959
PS20, Line 959: # Parquet nested types, left nested in scan - should reserve 
memory for nested column
I came up with a couple more permutations on the nested types tests after 
looking at the code again:
* materializing a nested column inside a collection, but no top level column.
* materializing two levels of nested.

I also think the DISTRIBUTEDPLAN and PARALLELPLANS sections aren't that useful 
for all the permutations of the scan tests, so I removed a lot of them to 
reduce the size of the test file.


http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/admission-reject-min-reservation.test
File 
testdata/workloads/functional-query/queries/QueryTest/admission-reject-min-reservation.test:

http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/admission-reject-min-reservation.test@8
PS20, Line 8: 100
> why did this decrease?
Relative to previous patchsets? The 256KB min reservation per column previously 
came into play, since the files involved are very small. Now that the min 
reservation is 8KB, the reservation decreased.


http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/admission-reject-min-reservation.test@32
PS20, Line 32:
> and these two
See above


http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/analytic-fns.test
File testdata/workloads/functional-query/queries/QueryTest/analytic-fns.test:

http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/analytic-fns.test@1587
PS20, Line 1587: 7MB extra for I/O buffers)
> how did you determine that? is that the min reservation computed by the new
Yeah, based on the plan. I didn't update this since the previous changes. It's 
now a much lower value.


http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/codegen-mem-limit.test
File 
testdata/workloads/functional-query/queries/QueryTest/codegen-mem-limit.test:

http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/codegen-mem-limit.test@6
PS20, Line 6: with t as (values(1), (2), (3), (4)) select min(t1.`1` + t2.`1`) 
from t t1 join t t2;
> what the reason behind this change?
The old query now fails to get reservation and fails earlier than before, 
instead of during codegen as intended. I needed a query without a scan in it to 
actually hit memory limit exceeded at the right place.


http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
File 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test:

http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test@a165
PS20, Line 165:
> don't we still want to do that?
I filed IMPALA-6560 to do that. It might need some specially crafted data now 
to do it non-flakily.


http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test
File 
testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test:

http://gerrit.cloudera.org:8080/#/c/8966/20/testdata/workloads/functional-query/queries/QueryTest/spilling-no-debug-action.test@7
PS20, Line 7: set debug_action="-1:OPEN:SET_DENY_RESERVATION_PROBABILITY@1.0";
> give the name of the file, this is surprising. Can we add a file level comm
Updated comments here and in Python to clarify.


http://gerrit.cloudera.org:8080/#/c/8966/20/tests/query_test/test_query_mem_limit.py
File tests/query_test/test_query_mem_limit.py:

http://gerrit.cloudera.org:8080/#/c/8966/20/tests/query_test/test_query_mem_limit.py@122
PS20, Line 122: num_nodes=1
> why do we do that?
Added a comment. Without it the fragments raced to allocate memory and it 
usually failed somewhere else.



--
To view, visit http://gerrit.cloudera.org:8080/8966
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786
Gerrit-Change-Number: 8966
Gerrit-PatchSet: 20
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Thu, 22 Feb 2018 01:13:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4835: Part 3: switch I/O buffers to buffer pool

Reply via email to