Re: Re: [DISCUSS] KIP-607: Add Metrics to Record the Memory Used by RocksDB to Kafka Streams

Sophie Blee-Goldman Thu, 23 Jul 2020 11:47:26 -0700

Guozhang,

Just to clarify, the "heap" for all these objects is actually the C++ heap,
not the JVM heap. So in the words of a Java application these would
 all be considered "off-heap", right?


(Of course there are some pointers in the Java heap to the off-heap
objects but that size is trivial compared to the actual objects)

Sorry for being pedantic. I just happen to know this is a question that
gets frequently asked so it's probably good to be as clear  as possible
in the KIP/metrics description.

Also, can you clarify a bit what you mean by hard bounded vs soft bounded?
For example, my impression is that the memtables are actually not hard
bounded at all, while the block cache is soft bounded by default but can
be configured to be hard bounded. And obviously the OS cache is not
exactly bounded but it shouldn't cause you to run out of usable memory
(like the memtables for example might, and have). But I think maybe you're
using a different definition of hard/soft bounded than I'm thinking of

On Thu, Jul 23, 2020 at 8:07 AM Guozhang Wang <wangg...@gmail.com> wrote:

> Thanks Bruno, they made sense to me.
>
> Regarding the last comment, my main reasoning is that it's better to
> explain to users the rocksDB memory usage and link the metrics to different
> categories.
>
> Just to kick off this (and also asking for correction of my own
> understanding :) here's what I read from the metrics:
>
> 1. Memtables (aka writer buffers, AND read buffers for iterators which
> would pin the immutable memtables from flushing). It is allocated on-heap
> and hard-bounded (via memtable_size * max_num_memtables).
>
>    - cur-size-active-mem-table: active
>    - cur-size-all-mem-tables: active + unflushed write
>    - size-all-mem-tables: active + unflushed write + pinned read
>
> 2. Block cache (used for merging / compaction, reads). Allocated on-heap
> and soft-bounded.
>
>    - block-cache-usage: compaction + read
>    - block-cache-pinned-usage: read
>
> 3. OS cache (read buffer), which is the memory usage for filters / indices
> that are outside block cache. Allocated off-heap and not bounded at all.
>
>    - estimate-table-readers-mem
>
>
> The total memory usage (on-heap and off-heap) is "size-all-mem-tables" +
> "block-cache-usage" + "estimate-table-readers-mem".
>
> Is that right?
>
>
> On Wed, Jul 22, 2020 at 4:28 AM Bruno Cadonna <br...@confluent.io> wrote:
>
> > Hi Guozhang,
> >
> > Thank you for your feedback!
> >
> > I answered inline.
> >
> > Best,
> > Bruno
> >
> >
> > On 21.07.20 00:39, Guozhang Wang wrote:
> > > Hello Bruno,
> > >
> > > Thanks for the updated KIP. I made a pass and here are some comments:
> > >
> > > 1) What's the motivation of keeping it as INFO while KIP-471 metrics
> are
> > > defined in DEBUG?
> > >
> >
> > The motivation was that the metrics in this KIP do not incur any
> > performance overhead other than reading out the properties from RocksDB.
> > For this metrics RocksDB does not need to maintain anything
> > additionally. In contrast, for the metrics in KIP-471 RocksDB needs to
> > maintain the statistics object we pass to it and we also need to switch
> > on a certain statistics level. So, I thought the metrics in this KIP are
> > suited to be used in production and therefore can be reported on INFO
> > level.
> >
> > > 2) Some namings are a bit inconsistent with others and with KIP-471,
> for
> > > example:
> >
> > I am aware of the inconsistencies. I took the names from this list in
> > the RocksDB repo
> >
> >
> https://github.com/facebook/rocksdb/blob/b9a4a10659969c71e6f6eab4e4bae8c36ede919f/include/rocksdb/db.h#L654-L686
> > (with prefix "rocksdb." ripped off). In this way users do not need to
> > look up or memorize a mapping between our metrics and the RocksDB
> > properties. To be clear, those are public RocksDB properties.
> >
> > >
> > > 2.a) KIP-471 uses "memtable" while in this KIP we use "mem-table", also
> > the
> > > "memtable" is prefixed and then the metric name. I'd suggest we keep
> them
> > > consistent. e.g. "num-immutable-mem-table" =>
> "immutable-memtable-count",
> > > "cur-size-active-mem-table" => "active-memable-bytes"
> > >
> > > 2.b) "immutable" are abbreviated as "imm" in some names but not in
> > others,
> > > I'd suggest we do not use abbreviations across all names,
> > > e.g. "num-entries-imm-mem-tables" => "immutable-memtable-num-entries".
> > >
> > > 2.c) "-size" "-num" semantics is usually a bit unclear, and I'd suggest
> > we
> > > just more concrete terms, e.g. "total-sst-files-size" =>
> > > "total-sst-files-bytes", "num-live-versions" => "live-versions-count",
> > > "background-errors" => "background-errors-count".
> > >
> > > 3) Some metrics are a bit confusing, e.g.
> > >
> > > 3.a) What's the difference between "cur-size-all-mem-tables" and
> > > "size-all-mem-tables"?
> > >
> >
> > cur-size-all-mem-tables records the approximate size of active and
> > unflushed immutable memtable. Unflushed immutable memtables are
> > memtables that are not yet flushed by the asynchronous flushing
> > mechanism in RocksDB.
> >
> > size-all-mem-tables records the sizes recorded in
> > cur-size-all-mem-tables but additionally also records pinned immutable
> > memtables that that are kept in memory to maintain write history in
> memory.
> >
> > As far as I understood those are memtables that are flushed but there
> > are still table readers (e.g. iterators) that use those memtables.
> >
> > I added a sentence to explain the difference.
> >
> > I guess it is worthwhile to have both of these metrics because if
> > size-all-mem-tables keeps increasing and cur-size-all-mem-tables not
> > there may be an issue with the clean-up of table readers.
> >
> > > 3.b) And the explanation of "estimate-table-readers-mem" does not read
> > very
> > > clear to me either, does it refer to
> > "estimate-sst-file-read-buffer-bytes"?
> > >
> >
> > No, this metric records the memory used by iterators as well as filters
> > and indices if the filters and indices are not maintained in the block
> > cache. Basically this metric reports the memory used outside the block
> > cache to read data. I modified the description to make it clearer.
> >
> > > 3.c) How does "estimate-oldest-key-time" help with memory usage
> > debugging?
> >
> > I do not consider this KIP to only help with monitoring of memory usage.
> > I thought to expose all RocksDB properties that return an integer and
> > that make sense for Kafka Streams.
> > Admittedly, I did a bad job in the current KIP to explain this in the
> > motivation.
> >
> >
> > >
> > > 4) For my own education, does "estimate-pending-compaction-bytes"
> capture
> > > all the memory usage for compaction buffers?
> > >
> >
> > No, as far as I understand, this metric refers to bytes rewritten on
> > disk. Basically, metric relates to the write amplification for level
> > compaction. I changed the description.
> >
> > > 5) This is just of a nit comment to help readers better understand
> > rocksDB:
> > > maybe we can explain in the wiki doc which part of rocksDB uses memory
> > > (block cache, OS cache, memtable, compaction buffer, read buffer), and
> > > which of them are on-heap and wich of them are off-heap, which can be
> > hard
> > > bounded and which can only be soft bounded and which cannot be bounded
> at
> > > all, etc.
> > >
> >
> > Good idea! Will look into it!
> >
> > >
> > > Guozhang
> > >
> > >
> > > On Mon, Jul 20, 2020 at 11:00 AM Bruno Cadonna <br...@confluent.io>
> > wrote:
> > >
> > >> Hi,
> > >>
> > >> During the implementation of this KIP and after some discussion about
> > >> RocksDB metrics, I decided to make some major modifications to this
> KIP
> > >> and kick off discussion again.
> > >>
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB
> > >>
> > >> Best,
> > >> Bruno
> > >>
> > >> On 15.05.20 17:11, Bill Bejeck wrote:
> > >>> Thanks for the KIP, Bruno. Having sensible, easy to access RocksDB
> > memory
> > >>> reporting will be a welcomed addition.
> > >>>
> > >>> FWIW I also agree to have the metrics reported on a store level. I'm
> > glad
> > >>> you changed the KIP to that effect.
> > >>>
> > >>> -Bill
> > >>>
> > >>>
> > >>>
> > >>> On Wed, May 13, 2020 at 6:24 PM Guozhang Wang <wangg...@gmail.com>
> > >> wrote:
> > >>>
> > >>>> Hi Bruno,
> > >>>>
> > >>>> Sounds good to me.
> > >>>>
> > >>>> I think I'm just a bit more curious to see its impact on
> performance:
> > as
> > >>>> long as we have one INFO level rocksDB metrics, then we'd have to
> turn
> > >> on
> > >>>> the scheduled rocksdb metrics recorder whereas previously, we can
> > >> decide to
> > >>>> not turn on the recorder at all if all are set as DEBUG and we
> > >> configure at
> > >>>> INFO level in production. But this is an implementation detail
> anyways
> > >> and
> > >>>> maybe the impact is negligible after all. We can check and
> re-discuss
> > >> this
> > >>>> afterwards :)
> > >>>>
> > >>>>
> > >>>> Guozhang
> > >>>>
> > >>>>
> > >>>> On Wed, May 13, 2020 at 9:34 AM Sophie Blee-Goldman <
> > >> sop...@confluent.io>
> > >>>> wrote:
> > >>>>
> > >>>>> Thanks Bruno! I took a look at the revised KIP and it looks good to
> > me.
> > >>>>>
> > >>>>> Sophie
> > >>>>>
> > >>>>> On Wed, May 13, 2020 at 6:59 AM Bruno Cadonna <br...@confluent.io>
> > >>>> wrote:
> > >>>>>
> > >>>>>> Hi John,
> > >>>>>>
> > >>>>>> Thank you for the feedback!
> > >>>>>>
> > >>>>>> I agree and I will change the KIP as I stated in my previous
> e-mail
> > to
> > >>>>>> Guozhang.
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Bruno
> > >>>>>>
> > >>>>>> On Tue, May 12, 2020 at 3:07 AM John Roesler <vvcep...@apache.org
> >
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>> Thanks, all.
> > >>>>>>>
> > >>>>>>> If you don’t mind, I’ll pitch in a few cents’ worth.
> > >>>>>>>
> > >>>>>>> In my life I’ve generally found more granular metrics to be more
> > >>>>> useful,
> > >>>>>> as long as there’s a sane way to roll them up. It does seem nice
> to
> > >> see
> > >>>>> it
> > >>>>>> on the per-store level. For roll-up purposes, the task and thread
> > tags
> > >>>>>> should be sufficient.
> > >>>>>>>
> > >>>>>>> I think the only reason we make some metrics Debug is that
> > >>>> _recording_
> > >>>>>> them can be expensive. If there’s no added expense, I think we can
> > >> just
> > >>>>>> register store-level metrics at Info level.
> > >>>>>>>
> > >>>>>>> Thanks for the KIP, Bruno!
> > >>>>>>> -John
> > >>>>>>>
> > >>>>>>> On Mon, May 11, 2020, at 17:32, Guozhang Wang wrote:
> > >>>>>>>> Hello Sophie / Bruno,
> > >>>>>>>>
> > >>>>>>>> I've also thought about the leveling question, and one
> motivation
> > I
> > >>>>>> had for
> > >>>>>>>> setting it in instance-level is that we want to expose it in
> INFO
> > >>>>>> level:
> > >>>>>>>> today our report leveling is not very finer grained --- which I
> > >>>> think
> > >>>>>> is
> > >>>>>>>> sth. worth itself --- such that one have to either turn on all
> > >>>> DEBUG
> > >>>>>>>> metrics recording or none of them. If we can allow users to e.g.
> > >>>>>> specify
> > >>>>>>>> "turn on streams-metrics and stream-state-metrics, but not
> others"
> > >>>>> and
> > >>>>>> then
> > >>>>>>>> I think it should be just at store-level. However, right now if
> we
> > >>>>>> want to
> > >>>>>>>> set it as store-level then it would be DEBUG and not exposed by
> > >>>>>> default.
> > >>>>>>>>
> > >>>>>>>> So it seems we have several options in addition to the proposed
> > >>>> one:
> > >>>>>>>>
> > >>>>>>>> a) we set it at store-level as INFO; but then one can argue why
> > >>>> this
> > >>>>> is
> > >>>>>>>> INFO while others (bytes-written, etc) are DEBUG.
> > >>>>>>>> b) we set it at store-level as DEBUG, believing that we do not
> > >>>>> usually
> > >>>>>> need
> > >>>>>>>> to turn it on.
> > >>>>>>>> c) maybe, we can set it at task-level (? I'm not so sure myself
> > >>>> about
> > >>>>>>>> this.. :P) as INFO.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Guozhang
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Mon, May 11, 2020 at 12:29 PM Sophie Blee-Goldman <
> > >>>>>> sop...@confluent.io>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hey Bruno,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the KIP! I have one high-level concern, which is
> that
> > >>>> we
> > >>>>>> should
> > >>>>>>>>> consider
> > >>>>>>>>> reporting these metrics on the per-store level rather than
> > >>>>>> instance-wide. I
> > >>>>>>>>> know I was
> > >>>>>>>>> the one who first proposed making it instance-wide, so bear
> with
> > >>>>> me:
> > >>>>>>>>>
> > >>>>>>>>> While I would still argue that the instance-wide memory usage
> is
> > >>>>>> probably
> > >>>>>>>>> the most *useful*,
> > >>>>>>>>> exposing them at the store-level does not prevent users from
> > >>>>>> monitoring the
> > >>>>>>>>> instance-wide
> > >>>>>>>>> memory. They should be able to roll up all the store-level
> > >>>> metrics
> > >>>>>> on an
> > >>>>>>>>> instance to
> > >>>>>>>>> compute the total off-heap memory. But rolling it up for the
> > >>>> users
> > >>>>>> does
> > >>>>>>>>> prevent them from
> > >>>>>>>>> using this to debug rare cases where one store may be using
> > >>>>>> significantly
> > >>>>>>>>> more memory than
> > >>>>>>>>> expected.
> > >>>>>>>>>
> > >>>>>>>>> It's also worth considering that some users may be using the
> > >>>>> bounded
> > >>>>>> memory
> > >>>>>>>>> config setter
> > >>>>>>>>> to put a cap on the off-heap memory of the entire process, in
> > >>>> which
> > >>>>>> case
> > >>>>>>>>> the memory usage
> > >>>>>>>>> metric for any one store should reflect the memory usage of the
> > >>>>>> entire
> > >>>>>>>>> instance. In that case
> > >>>>>>>>> any effort to roll up the memory usages ourselves would just be
> > >>>>>> wasted.
> > >>>>>>>>>
> > >>>>>>>>> Sorry for the reversal, but after a second thought I'm pretty
> > >>>>>> strongly in
> > >>>>>>>>> favor of reporting these
> > >>>>>>>>> at the store level.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Sophie
> > >>>>>>>>>
> > >>>>>>>>> On Wed, May 6, 2020 at 8:41 AM Bruno Cadonna <
> br...@confluent.io
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi all,
> > >>>>>>>>>>
> > >>>>>>>>>> I'd like to discuss KIP-607 that aims to add RocksDB memory
> > >>>> usage
> > >>>>>>>>>> metrics to Kafka Streams.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Record+the+Memory+Used+by+RocksDB+to+Kafka+Streams
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Bruno
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> -- Guozhang
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> -- Guozhang
> > >>>>
> > >>>
> > >>
> > >
> > >
> >
>
>
> --
> -- Guozhang
>

Re: Re: [DISCUSS] KIP-607: Add Metrics to Record the Memory Used by RocksDB to Kafka Streams

Reply via email to