[kudu-CR] KUDU-3056: Reduce HdrHistogramAccumulator overhead

Adar Dembo (Code Review) Thu, 20 Feb 2020 09:43:15 -0800

Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15254 )


Change subject: KUDU-3056: Reduce HdrHistogramAccumulator overhead
......................................................................


Patch Set 1:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@19
PS1, Line 19: adjust
adjusts


http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@24
PS1, Line 24: Last
Lastly


http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@25
PS1, Line 25: The result is relatively similar
            : output in the Spark accumulator with a significantly smaller
            : histogram.
I took a quick look at the HdrHistograms we use server-side. Most use a 
num_sig_digits of 2, and those that use 3 are server-level (rather than 
tablet-level) so that the increase in memory footprint doesn't scale with 
server load.

For context, here's what Will said about the initial HdrHistogramAccumulator 
sizing:

    HdrHistograms need to be shipped between executors and the driver, so
    their (serialized) size is relevant. Spark users differ in how they
    serialize, so I didn't put much effort into estimating the serialized
    size, but based on the conservative formula in [1] the in-memory size of
    a histogram with 3 significant value digits and storing longs is 4MiB or
    so. That only happens if the histogram is storing values from 1 to the
    max trackable long value, which is Long.MAX / 2. More realistically,
    the values in the duration histogram should be at most 86400 * 1000, the
    number of milliseconds in a day, and usually much, much smaller. For
    that range of values, the max footprint is 1MiB. That should be a safe
    amount of data to ship about semi-frequently along with all the Kudu
    data (and I'm not counting potential compression as part of
    serialization).

    [1]: https://github.com/HdrHistogram/HdrHistogram#footprint-estimation

Sounds like maybe he didn't factor in the number of executors that may be in 
use?


http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@30
PS1, Line 30: implimentation
implementation


http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@32
PS1, Line 32: not
drop this


http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@53
PS1, Line 53: to
"so as to"


http://gerrit.cloudera.org:8080/#/c/15254/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
File 
java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala:

http://gerrit.cloudera.org:8080/#/c/15254/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala@47
PS1, Line 47: histogram
Why don't we need .copy() here any more?



--
To view, visit http://gerrit.cloudera.org:8080/15254
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic7c2a33bc61a2baa38703ea3340a07e06ab39db3
Gerrit-Change-Number: 15254
Gerrit-PatchSet: 1
Gerrit-Owner: Grant Henke <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Hao Hao <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Thu, 20 Feb 2020 17:42:48 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-3056: Reduce HdrHistogramAccumulator overhead

Reply via email to