Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/15254 )
Change subject: KUDU-3056: Reduce HdrHistogramAccumulator overhead ...................................................................... Patch Set 1: (7 comments) http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@19 PS1, Line 19: adjust adjusts http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@24 PS1, Line 24: Last Lastly http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@25 PS1, Line 25: The result is relatively similar : output in the Spark accumulator with a significantly smaller : histogram. I took a quick look at the HdrHistograms we use server-side. Most use a num_sig_digits of 2, and those that use 3 are server-level (rather than tablet-level) so that the increase in memory footprint doesn't scale with server load. For context, here's what Will said about the initial HdrHistogramAccumulator sizing: HdrHistograms need to be shipped between executors and the driver, so their (serialized) size is relevant. Spark users differ in how they serialize, so I didn't put much effort into estimating the serialized size, but based on the conservative formula in [1] the in-memory size of a histogram with 3 significant value digits and storing longs is 4MiB or so. That only happens if the histogram is storing values from 1 to the max trackable long value, which is Long.MAX / 2. More realistically, the values in the duration histogram should be at most 86400 * 1000, the number of milliseconds in a day, and usually much, much smaller. For that range of values, the max footprint is 1MiB. That should be a safe amount of data to ship about semi-frequently along with all the Kudu data (and I'm not counting potential compression as part of serialization). [1]: https://github.com/HdrHistogram/HdrHistogram#footprint-estimation Sounds like maybe he didn't factor in the number of executors that may be in use? http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@30 PS1, Line 30: implimentation implementation http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@32 PS1, Line 32: not drop this http://gerrit.cloudera.org:8080/#/c/15254/1//COMMIT_MSG@53 PS1, Line 53: to "so as to" http://gerrit.cloudera.org:8080/#/c/15254/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala: http://gerrit.cloudera.org:8080/#/c/15254/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala@47 PS1, Line 47: histogram Why don't we need .copy() here any more? -- To view, visit http://gerrit.cloudera.org:8080/15254 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic7c2a33bc61a2baa38703ea3340a07e06ab39db3 Gerrit-Change-Number: 15254 Gerrit-PatchSet: 1 Gerrit-Owner: Grant Henke <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Thu, 20 Feb 2020 17:42:48 +0000 Gerrit-HasComments: Yes
