Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9757 )

Change subject: IMPALA-6679,IMPALA-6678: reduce scan reservation
......................................................................


Patch Set 6:

(35 comments)

I desperately need to rebase to be able to run tests (running into pypi 
issues). I haven't addressed Bikram's requests about added tests yet, but I've 
addressed everything else, so I'll push out my current changes, rebase and then 
do another pass over comments to make sure that I addressed everything.

http://gerrit.cloudera.org:8080/#/c/9757/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9757/6//COMMIT_MSG@37
PS6, Line 37: ideal reservation
> above you say we no longer need a concept of ideal reservation -- so what i
We still have the concept of an ideal reservation, it's just computed once we 
know the file layout rather than estimated in the planner. I didn't intend the 
above comment to mean that - reworded to reduce ambiguity.


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-parquet-scanner-test.cc
File be/src/exec/hdfs-parquet-scanner-test.cc:

http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-parquet-scanner-test.cc@45
PS6, Line 45: // Should round up to nearest power-of-two buffer size if < max 
scan range buffer
> maybe add a case that is somewhere in between min and max buffer
Added an extra test. I think MIN_BUFFER_SIZE + 2 sort-of covered this but the 
extra coverage doesn't hurt.


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-parquet-scanner.cc@1699
PS6, Line 1699: ideal_reservation += min(BitUtil::RoundUpToPowerOf2(len, 
max_buffer_size),
              :           DiskIoMgr::IDEAL_MAX_SIZED_BUFFERS_PER_SCAN_RANGE * 
max_buffer_size);
> maybe mention here briefly why we increment in steps of max_buffer_size if
I realised that is is duplicating a calculation from HdfsScanNodeBase. I 
factored that into DiskIoMgr since it's closely tied to 
IDEAL_MAX_SIZED_BUFFERS_PER_RANGE and other details of the IoMgr's buffer 
management.


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node-base.cc@527
PS6, Line 527: int64_t target = BitUtil::RoundUpToPowerOf2(*reservation, 
max_buffer_size);
             :       if (*reservation == target) {
             :         // If it's already an I/O buffer multiple, increase by 
an I/O buffer or to ideal.
             :         target = min(*reservation + max_buffer_size, 
ideal_scan_range_reservation);
             :       }
> how about
That's a nicer way to express it.


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.h
File be/src/exec/hdfs-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.h@178
PS6, Line 178:   /// this thread with DeductReservationForScannerThread().
> maybe mention that it releases Reservation and only holds on to minimum res
Good point. I referenced ReturnReservationFromScannerThread() instead of 
describing the behaviour of ReturnReservationFromScannerThread()


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.h@204
PS6, Line 204: or
> nit: return all
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc@226
PS6, Line 226: spare_reservation_
> what if between line 225 and 226, some scanner thread releases its reservat
We're actually holding the lock everywhere that we touch spare_reservation_ so 
the race isn't possible. Now that we're doing that, it's actually pointless to 
use an atomic for spare_reservation_ so I'll fix that.


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc@392
PS6, Line 392: // Take a snapshot of num_unqueued_files_ before calling 
GetNextUnstartedRange().
             :     // We don't want num_unqueued_files_ to go to zero between 
the return from
             :     // GetNextUnstartedRange() and the check for when all ranges 
are complete.
> update comment: replace GetNextUnstartedRange with StartNextScanRange.
Done. Below it seemed better to describe the logical condition rather than the 
NULLness of the pointer.


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc@481
PS6, Line 481: *scanner_thread_reservation, partition, filter_ctxs, 
expr_results_pool);
             :   context.AddStream(scan_range, *scanner_thread_reservation);
> reservation passed twice. can we remove redundancy by using context->total_
There are multiple streams per reservation so it's not generally true that the 
stream's reservation is equal to the total reservation.


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/hdfs-scan-node.cc@518
PS6, Line 518: // Reservation may have been increased by the scanner.
> maybe mention, that this happens only in case of column formats like parque
I don't think this is the right place to document global assumptions about what 
scanner subclasses can/can't do, but it is helpful to add an example to make it 
clearer why it may change.


http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/scanner-context.h
File be/src/exec/scanner-context.h:

http://gerrit.cloudera.org:8080/#/c/9757/6/be/src/exec/scanner-context.h@379
PS6, Line 379: o
> nit: to allow
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@150
PS6, Line 150:   // Default reservation for a backend scan range for a column 
in columnar formats like
> should we mention the units just to ward off any ambiguity?
Seems reasonable to do. I think we've been consistent about expressing 
reservation in bytes, so it doesn't seem strictly necessary.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@150
PS6, Line 150: backend
> IOMgr is the terminology we use below (and I think that's clearer).
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@154
PS6, Line 154: // TODO: is it worth making this a tunable query option?
> not a huge fan of adding more knobs that the user doesnt know how to tune.
I think it would just be a safety valve if the planner made a bad decision.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@157
PS6, Line 157: FOOTER_SIZE
> maybe name both the same so that a grep will also find both.
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@158
PS6, Line 158: FOOTER_READ_SIZE
> nit: how about PARQUET_FOOTER_READ_SIZE, just to make it explicit
I'm actually going to have to update this anyway when I rebase since the ORC 
patch uses the same number for the ORC footer, so I'll wait until the rebase to 
adjust this.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@752
PS6, Line 752: maxScanRangeBytes_
> confusing at first when you look at this and maxScanRangeLength,
Renamed to scanRangeBytesLimit and largestScanRangeBytes


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1264
PS6, Line 1264: fileFormats_.contains(HdfsFileFormat.PARQUET)
> Do we have enough testing for the multiple format case when it comes to res
Probably not.. we only have one table which is very basic. There's a single 
planner test. I moved the end-to-end test to test_scanners.py and run it with 
some more test dimensions. It seems like we need a larger table that includes 
some parquet files though.

I filed IMPALA-6874 to keep track of this. It would be good to add some more 
coverage to exercise this code but might be best to not roll it into this 
change. I'll start working on that in a separate patchset.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1264
PS6, Line 1264: fileFormats_.contains(HdfsFileFormat.PARQUET)
> will estimates be biased towards parquet if there are multiple formats?
The Parquet estimates will generally be higher, so I think this is 
conservative. We could make it more accurate in some cases by factoring in the 
numbers of files of different formats and their sizes but I don't think the 
mixed format case is common enough to justify it.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1320
PS6, Line 1320: Bound the reservation
              :    *  based on:
> is the following list a lower bound, upper bound, or both?
Restructured to separate upper and lower bounds.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1326
PS6, Line 1326: .
> and is available. or .. try to increase ..
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1337
PS6, Line 1337: columnReservations != null
> since != and == have same precedence, let's parenthesize this for clarity.
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1373
PS6, Line 1373:
> memory
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1374
PS6, Line 1374:  Returns one value per column.
> like you say below, these values aren't yet quantized the buffer sizes, rig
Updated comment.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1378
PS6, Line 1378:    * logic for non-shredded columnar formats if added.
> generally, do we need to worry about orc for this patch?
I haven't rebased onto ORC yet, so not for the current patchset.

I think I need to make a bit more progress on that before I know exactly what 
changes. With the current state of the ORC patch, it only does one I/O at a 
time so the Parquet logic doesn't apply. I'm going to take a look to see if we 
can switch ORC to using multiple parallel scan ranges, in which case the logic 
will be the same as Parquet for non-nested cases, which is all we support for 
now.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1394
PS6, Line 1394:             // Not a top-level column, e.g. a nested collection 
- no stats available.
> what does this case mean? scalar but not top level but also different than
This is the case where the value is logically inside a nested collection but is 
being unnested into a top-level slot by the scan. Updated the comment to be 
hopefully clearer.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1417
PS6, Line 1417: compute
> since this one appends (rather than returns like the other functions) maybe
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1440
PS6, Line 1440: column
> scalar column?
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1443
PS6, Line 1443: estimates
> nit: repeated word
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1444
PS6, Line 1444: In that case the reservation
              :    * may still be usable for a different column.
> you mean in the case where we underestimate it for another column but try t
Yeah, I was trying to make the point that overestimates can be cancelled out by 
underestimates (or columns > 4MB where we reserve less than the column size). 
Without that, I think readers might reasonably assume that overestimates result 
in memory being wasted.

Tried to reword the comment to be a little clearer.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1448
PS6, Line 1448: computeMinColumnReservation
> should this be computeMinColumnReservationForScalar() or similar?
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1465
PS6, Line 1465: encodedDataSize
> nit: encodedDataByteSize
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1469
PS6, Line 1469:     return estimate;
> have you verified how close this comes out to the actual size for various d
I mainly looked at TPC-H data, e.g. 
https://gist.github.com/timarmstrong/5df96c2c8d0b10f78e3c8e94362b5aee . It was 
very accurate for low-NDV columns, e.g. l_returnflag. l_orderkey threw it off 
because it's high-NDV but is highly compressible because the file is clustered 
by l_orderkey.

This is probably somewhat ideal, in that the data is uniformly distributed 
between files and fairly regular. For messier cases there's probably more of a 
tendency to overestimate.

I added stats counter for ideal and actual reservations for Parquet. I think 
that should indicate if there's a systematic problem of over/under reservation 
for a scan.


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1472
PS6, Line 1472:   private static long roundUpToIoBuffer(long bytes, long 
maxIoBufferSize) {
> add a quick function comment.
Done


http://gerrit.cloudera.org:8080/#/c/9757/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1475
PS6, Line 1475: BitUtil.roundUpToPowerOf2Factor(bytes, maxIoBufferSize);
> why is it the right thing to always have a multiple of maxIoBufferSize in t
Updated the fn comment to provide some motivation.



--
To view, visit http://gerrit.cloudera.org:8080/9757
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842
Gerrit-Change-Number: 9757
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Bikramjeet Vig <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Wed, 18 Apr 2018 18:40:46 +0000
Gerrit-HasComments: Yes

Reply via email to