Thanks Bruno, That is clear. I think my misunderstanding is that I thought compaction is done by loading sst files into block cache (not OS), do the merge-sort and then write back to sst. But on a second thought I agree that it is not a good use case fitted for caching anyways. Using non-cached OS memory is more reasonable.
Guozhang On Mon, Jul 27, 2020 at 2:49 AM Bruno Cadonna <br...@confluent.io> wrote: > Hi Guozhang, > > Do you mean compression or compaction? > > Regarding compression, I agree with you except for the merge-sorting > part. The docs describing what is stored in the block cache can be found > under https://github.com/facebook/rocksdb/wiki/Block-Cache. > > Regarding compaction, my statement in my previous e-mail about > compaction not using block cache was a guess. To get to bottom of it, I > asked somebody from RockDB and compaction does indeed not use block > cache. Compaction uses the OS to read in the data to compact. But it > also uses fadvise to tell the kernel to not cache the data in the OS > buffer cache. > > Hope that clears up the conflicts! ;-) > > Best, > Bruno > > On 24.07.20 19:37, Guozhang Wang wrote: > > Ack, thanks for the clarification folks! Yeah I agree from JVM's point > all > > rocksDB memory are off-heap basically (which makes operations harder, > > sigh..) > > > > Regarding the block cache, my understanding is that by default compressed > > blocks are in the OS page cache, uncompressed blocks are in the RocksDB > > block cache. In Streams, we do not use compression by default, so these > > data blocks would be read into the block cache for merge-sorting while > > index / bloom filter / other compressed dictionary blocks are read into > OS > > cache by default. Obviously that conflicts from yours, maybe you can > point > > me to the related docs? > > > > Guozhang > > > > On Fri, Jul 24, 2020 at 2:15 AM Bruno Cadonna <br...@confluent.io> > wrote: > > > >> Hi Guozhang and Sophie, > >> > >> 1) > >> My understanding is also that the memtables are off-heap (as almost > >> every data structure in RocksDB). > >> > >> According to the docs, if after a write the size of the memtable exceeds > >> option write_buffer_size the memtable is flushed. I would not call it > >> hard bounded since it seems the memtable can exceed this size. > >> > >> Guozhang's other statements about memtables seem correct to me. > >> > >> 2) > >> According to the docs, the block cache caches data in memory for reads. > >> I do not think RocksDB uses the block cache for compaction, because that > >> would mean each compaction would interfere with the used cache > >> replacement policy (LRU is the default in Streams). I suppose RocksDB > >> uses the OS cache during compactions. So block cache usage contains data > >> blocks for reading and it can also contain index blocks and filter block > >> if configured accordingly (e.g. by using the BoundedMemoryRocksDBConfig > >> described under > >> > >> > https://kafka.apache.org/25/documentation/streams/developer-guide/memory-mgmt.html > ). > >> > >> The pinned usage are blocks pinned by table readers like iterators. > >> The block cache can be soft or hard bounded. However, there is currently > >> an open bug in RocksDB regarding hard-bounded block caches. > >> > >> 3) > >> The statements seem correct. > >> > >> The total memory usage seems also correct. > >> > >> Best, > >> Bruno > >> > >> On 23.07.20 20:46, Sophie Blee-Goldman wrote: > >>> Guozhang, > >>> > >>> Just to clarify, the "heap" for all these objects is actually the C++ > >> heap, > >>> not the JVM heap. So in the words of a Java application these would > >>> all be considered "off-heap", right? > >>> > >>> (Of course there are some pointers in the Java heap to the off-heap > >>> objects but that size is trivial compared to the actual objects) > >>> > >>> Sorry for being pedantic. I just happen to know this is a question that > >>> gets frequently asked so it's probably good to be as clear as possible > >>> in the KIP/metrics description. > >>> > >>> Also, can you clarify a bit what you mean by hard bounded vs soft > >> bounded? > >>> For example, my impression is that the memtables are actually not hard > >>> bounded at all, while the block cache is soft bounded by default but > can > >>> be configured to be hard bounded. And obviously the OS cache is not > >>> exactly bounded but it shouldn't cause you to run out of usable memory > >>> (like the memtables for example might, and have). But I think maybe > >> you're > >>> using a different definition of hard/soft bounded than I'm thinking of > >>> > >>> On Thu, Jul 23, 2020 at 8:07 AM Guozhang Wang <wangg...@gmail.com> > >> wrote: > >>> > >>>> Thanks Bruno, they made sense to me. > >>>> > >>>> Regarding the last comment, my main reasoning is that it's better to > >>>> explain to users the rocksDB memory usage and link the metrics to > >> different > >>>> categories. > >>>> > >>>> Just to kick off this (and also asking for correction of my own > >>>> understanding :) here's what I read from the metrics: > >>>> > >>>> 1. Memtables (aka writer buffers, AND read buffers for iterators which > >>>> would pin the immutable memtables from flushing). It is allocated > >> on-heap > >>>> and hard-bounded (via memtable_size * max_num_memtables). > >>>> > >>>> - cur-size-active-mem-table: active > >>>> - cur-size-all-mem-tables: active + unflushed write > >>>> - size-all-mem-tables: active + unflushed write + pinned read > >>>> > >>>> 2. Block cache (used for merging / compaction, reads). Allocated > on-heap > >>>> and soft-bounded. > >>>> > >>>> - block-cache-usage: compaction + read > >>>> - block-cache-pinned-usage: read > >>>> > >>>> 3. OS cache (read buffer), which is the memory usage for filters / > >> indices > >>>> that are outside block cache. Allocated off-heap and not bounded at > all. > >>>> > >>>> - estimate-table-readers-mem > >>>> > >>>> > >>>> The total memory usage (on-heap and off-heap) is > "size-all-mem-tables" + > >>>> "block-cache-usage" + "estimate-table-readers-mem". > >>>> > >>>> Is that right? > >>>> > >>>> > >>>> On Wed, Jul 22, 2020 at 4:28 AM Bruno Cadonna <br...@confluent.io> > >> wrote: > >>>> > >>>>> Hi Guozhang, > >>>>> > >>>>> Thank you for your feedback! > >>>>> > >>>>> I answered inline. > >>>>> > >>>>> Best, > >>>>> Bruno > >>>>> > >>>>> > >>>>> On 21.07.20 00:39, Guozhang Wang wrote: > >>>>>> Hello Bruno, > >>>>>> > >>>>>> Thanks for the updated KIP. I made a pass and here are some > comments: > >>>>>> > >>>>>> 1) What's the motivation of keeping it as INFO while KIP-471 metrics > >>>> are > >>>>>> defined in DEBUG? > >>>>>> > >>>>> > >>>>> The motivation was that the metrics in this KIP do not incur any > >>>>> performance overhead other than reading out the properties from > >> RocksDB. > >>>>> For this metrics RocksDB does not need to maintain anything > >>>>> additionally. In contrast, for the metrics in KIP-471 RocksDB needs > to > >>>>> maintain the statistics object we pass to it and we also need to > switch > >>>>> on a certain statistics level. So, I thought the metrics in this KIP > >> are > >>>>> suited to be used in production and therefore can be reported on INFO > >>>>> level. > >>>>> > >>>>>> 2) Some namings are a bit inconsistent with others and with KIP-471, > >>>> for > >>>>>> example: > >>>>> > >>>>> I am aware of the inconsistencies. I took the names from this list in > >>>>> the RocksDB repo > >>>>> > >>>>> > >>>> > >> > https://github.com/facebook/rocksdb/blob/b9a4a10659969c71e6f6eab4e4bae8c36ede919f/include/rocksdb/db.h#L654-L686 > >>>>> (with prefix "rocksdb." ripped off). In this way users do not need to > >>>>> look up or memorize a mapping between our metrics and the RocksDB > >>>>> properties. To be clear, those are public RocksDB properties. > >>>>> > >>>>>> > >>>>>> 2.a) KIP-471 uses "memtable" while in this KIP we use "mem-table", > >> also > >>>>> the > >>>>>> "memtable" is prefixed and then the metric name. I'd suggest we keep > >>>> them > >>>>>> consistent. e.g. "num-immutable-mem-table" => > >>>> "immutable-memtable-count", > >>>>>> "cur-size-active-mem-table" => "active-memable-bytes" > >>>>>> > >>>>>> 2.b) "immutable" are abbreviated as "imm" in some names but not in > >>>>> others, > >>>>>> I'd suggest we do not use abbreviations across all names, > >>>>>> e.g. "num-entries-imm-mem-tables" => > "immutable-memtable-num-entries". > >>>>>> > >>>>>> 2.c) "-size" "-num" semantics is usually a bit unclear, and I'd > >> suggest > >>>>> we > >>>>>> just more concrete terms, e.g. "total-sst-files-size" => > >>>>>> "total-sst-files-bytes", "num-live-versions" => > "live-versions-count", > >>>>>> "background-errors" => "background-errors-count". > >>>>>> > >>>>>> 3) Some metrics are a bit confusing, e.g. > >>>>>> > >>>>>> 3.a) What's the difference between "cur-size-all-mem-tables" and > >>>>>> "size-all-mem-tables"? > >>>>>> > >>>>> > >>>>> cur-size-all-mem-tables records the approximate size of active and > >>>>> unflushed immutable memtable. Unflushed immutable memtables are > >>>>> memtables that are not yet flushed by the asynchronous flushing > >>>>> mechanism in RocksDB. > >>>>> > >>>>> size-all-mem-tables records the sizes recorded in > >>>>> cur-size-all-mem-tables but additionally also records pinned > immutable > >>>>> memtables that that are kept in memory to maintain write history in > >>>> memory. > >>>>> > >>>>> As far as I understood those are memtables that are flushed but there > >>>>> are still table readers (e.g. iterators) that use those memtables. > >>>>> > >>>>> I added a sentence to explain the difference. > >>>>> > >>>>> I guess it is worthwhile to have both of these metrics because if > >>>>> size-all-mem-tables keeps increasing and cur-size-all-mem-tables not > >>>>> there may be an issue with the clean-up of table readers. > >>>>> > >>>>>> 3.b) And the explanation of "estimate-table-readers-mem" does not > read > >>>>> very > >>>>>> clear to me either, does it refer to > >>>>> "estimate-sst-file-read-buffer-bytes"? > >>>>>> > >>>>> > >>>>> No, this metric records the memory used by iterators as well as > filters > >>>>> and indices if the filters and indices are not maintained in the > block > >>>>> cache. Basically this metric reports the memory used outside the > block > >>>>> cache to read data. I modified the description to make it clearer. > >>>>> > >>>>>> 3.c) How does "estimate-oldest-key-time" help with memory usage > >>>>> debugging? > >>>>> > >>>>> I do not consider this KIP to only help with monitoring of memory > >> usage. > >>>>> I thought to expose all RocksDB properties that return an integer and > >>>>> that make sense for Kafka Streams. > >>>>> Admittedly, I did a bad job in the current KIP to explain this in the > >>>>> motivation. > >>>>> > >>>>> > >>>>>> > >>>>>> 4) For my own education, does "estimate-pending-compaction-bytes" > >>>> capture > >>>>>> all the memory usage for compaction buffers? > >>>>>> > >>>>> > >>>>> No, as far as I understand, this metric refers to bytes rewritten on > >>>>> disk. Basically, metric relates to the write amplification for level > >>>>> compaction. I changed the description. > >>>>> > >>>>>> 5) This is just of a nit comment to help readers better understand > >>>>> rocksDB: > >>>>>> maybe we can explain in the wiki doc which part of rocksDB uses > memory > >>>>>> (block cache, OS cache, memtable, compaction buffer, read buffer), > and > >>>>>> which of them are on-heap and wich of them are off-heap, which can > be > >>>>> hard > >>>>>> bounded and which can only be soft bounded and which cannot be > bounded > >>>> at > >>>>>> all, etc. > >>>>>> > >>>>> > >>>>> Good idea! Will look into it! > >>>>> > >>>>>> > >>>>>> Guozhang > >>>>>> > >>>>>> > >>>>>> On Mon, Jul 20, 2020 at 11:00 AM Bruno Cadonna <br...@confluent.io> > >>>>> wrote: > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> During the implementation of this KIP and after some discussion > about > >>>>>>> RocksDB metrics, I decided to make some major modifications to this > >>>> KIP > >>>>>>> and kick off discussion again. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB > >>>>>>> > >>>>>>> Best, > >>>>>>> Bruno > >>>>>>> > >>>>>>> On 15.05.20 17:11, Bill Bejeck wrote: > >>>>>>>> Thanks for the KIP, Bruno. Having sensible, easy to access RocksDB > >>>>> memory > >>>>>>>> reporting will be a welcomed addition. > >>>>>>>> > >>>>>>>> FWIW I also agree to have the metrics reported on a store level. > I'm > >>>>> glad > >>>>>>>> you changed the KIP to that effect. > >>>>>>>> > >>>>>>>> -Bill > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, May 13, 2020 at 6:24 PM Guozhang Wang <wangg...@gmail.com > > > >>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Bruno, > >>>>>>>>> > >>>>>>>>> Sounds good to me. > >>>>>>>>> > >>>>>>>>> I think I'm just a bit more curious to see its impact on > >>>> performance: > >>>>> as > >>>>>>>>> long as we have one INFO level rocksDB metrics, then we'd have to > >>>> turn > >>>>>>> on > >>>>>>>>> the scheduled rocksdb metrics recorder whereas previously, we can > >>>>>>> decide to > >>>>>>>>> not turn on the recorder at all if all are set as DEBUG and we > >>>>>>> configure at > >>>>>>>>> INFO level in production. But this is an implementation detail > >>>> anyways > >>>>>>> and > >>>>>>>>> maybe the impact is negligible after all. We can check and > >>>> re-discuss > >>>>>>> this > >>>>>>>>> afterwards :) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Guozhang > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, May 13, 2020 at 9:34 AM Sophie Blee-Goldman < > >>>>>>> sop...@confluent.io> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Thanks Bruno! I took a look at the revised KIP and it looks good > >> to > >>>>> me. > >>>>>>>>>> > >>>>>>>>>> Sophie > >>>>>>>>>> > >>>>>>>>>> On Wed, May 13, 2020 at 6:59 AM Bruno Cadonna < > br...@confluent.io > >>> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi John, > >>>>>>>>>>> > >>>>>>>>>>> Thank you for the feedback! > >>>>>>>>>>> > >>>>>>>>>>> I agree and I will change the KIP as I stated in my previous > >>>> e-mail > >>>>> to > >>>>>>>>>>> Guozhang. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Bruno > >>>>>>>>>>> > >>>>>>>>>>> On Tue, May 12, 2020 at 3:07 AM John Roesler < > >> vvcep...@apache.org > >>>>> > >>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, all. > >>>>>>>>>>>> > >>>>>>>>>>>> If you don’t mind, I’ll pitch in a few cents’ worth. > >>>>>>>>>>>> > >>>>>>>>>>>> In my life I’ve generally found more granular metrics to be > more > >>>>>>>>>> useful, > >>>>>>>>>>> as long as there’s a sane way to roll them up. It does seem > nice > >>>> to > >>>>>>> see > >>>>>>>>>> it > >>>>>>>>>>> on the per-store level. For roll-up purposes, the task and > thread > >>>>> tags > >>>>>>>>>>> should be sufficient. > >>>>>>>>>>>> > >>>>>>>>>>>> I think the only reason we make some metrics Debug is that > >>>>>>>>> _recording_ > >>>>>>>>>>> them can be expensive. If there’s no added expense, I think we > >> can > >>>>>>> just > >>>>>>>>>>> register store-level metrics at Info level. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks for the KIP, Bruno! > >>>>>>>>>>>> -John > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, May 11, 2020, at 17:32, Guozhang Wang wrote: > >>>>>>>>>>>>> Hello Sophie / Bruno, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I've also thought about the leveling question, and one > >>>> motivation > >>>>> I > >>>>>>>>>>> had for > >>>>>>>>>>>>> setting it in instance-level is that we want to expose it in > >>>> INFO > >>>>>>>>>>> level: > >>>>>>>>>>>>> today our report leveling is not very finer grained --- > which I > >>>>>>>>> think > >>>>>>>>>>> is > >>>>>>>>>>>>> sth. worth itself --- such that one have to either turn on > all > >>>>>>>>> DEBUG > >>>>>>>>>>>>> metrics recording or none of them. If we can allow users to > >> e.g. > >>>>>>>>>>> specify > >>>>>>>>>>>>> "turn on streams-metrics and stream-state-metrics, but not > >>>> others" > >>>>>>>>>> and > >>>>>>>>>>> then > >>>>>>>>>>>>> I think it should be just at store-level. However, right now > if > >>>> we > >>>>>>>>>>> want to > >>>>>>>>>>>>> set it as store-level then it would be DEBUG and not exposed > by > >>>>>>>>>>> default. > >>>>>>>>>>>>> > >>>>>>>>>>>>> So it seems we have several options in addition to the > proposed > >>>>>>>>> one: > >>>>>>>>>>>>> > >>>>>>>>>>>>> a) we set it at store-level as INFO; but then one can argue > why > >>>>>>>>> this > >>>>>>>>>> is > >>>>>>>>>>>>> INFO while others (bytes-written, etc) are DEBUG. > >>>>>>>>>>>>> b) we set it at store-level as DEBUG, believing that we do > not > >>>>>>>>>> usually > >>>>>>>>>>> need > >>>>>>>>>>>>> to turn it on. > >>>>>>>>>>>>> c) maybe, we can set it at task-level (? I'm not so sure > myself > >>>>>>>>> about > >>>>>>>>>>>>> this.. :P) as INFO. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Guozhang > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, May 11, 2020 at 12:29 PM Sophie Blee-Goldman < > >>>>>>>>>>> sop...@confluent.io> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hey Bruno, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for the KIP! I have one high-level concern, which is > >>>> that > >>>>>>>>> we > >>>>>>>>>>> should > >>>>>>>>>>>>>> consider > >>>>>>>>>>>>>> reporting these metrics on the per-store level rather than > >>>>>>>>>>> instance-wide. I > >>>>>>>>>>>>>> know I was > >>>>>>>>>>>>>> the one who first proposed making it instance-wide, so bear > >>>> with > >>>>>>>>>> me: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> While I would still argue that the instance-wide memory > usage > >>>> is > >>>>>>>>>>> probably > >>>>>>>>>>>>>> the most *useful*, > >>>>>>>>>>>>>> exposing them at the store-level does not prevent users from > >>>>>>>>>>> monitoring the > >>>>>>>>>>>>>> instance-wide > >>>>>>>>>>>>>> memory. They should be able to roll up all the store-level > >>>>>>>>> metrics > >>>>>>>>>>> on an > >>>>>>>>>>>>>> instance to > >>>>>>>>>>>>>> compute the total off-heap memory. But rolling it up for the > >>>>>>>>> users > >>>>>>>>>>> does > >>>>>>>>>>>>>> prevent them from > >>>>>>>>>>>>>> using this to debug rare cases where one store may be using > >>>>>>>>>>> significantly > >>>>>>>>>>>>>> more memory than > >>>>>>>>>>>>>> expected. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> It's also worth considering that some users may be using the > >>>>>>>>>> bounded > >>>>>>>>>>> memory > >>>>>>>>>>>>>> config setter > >>>>>>>>>>>>>> to put a cap on the off-heap memory of the entire process, > in > >>>>>>>>> which > >>>>>>>>>>> case > >>>>>>>>>>>>>> the memory usage > >>>>>>>>>>>>>> metric for any one store should reflect the memory usage of > >> the > >>>>>>>>>>> entire > >>>>>>>>>>>>>> instance. In that case > >>>>>>>>>>>>>> any effort to roll up the memory usages ourselves would just > >> be > >>>>>>>>>>> wasted. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Sorry for the reversal, but after a second thought I'm > pretty > >>>>>>>>>>> strongly in > >>>>>>>>>>>>>> favor of reporting these > >>>>>>>>>>>>>> at the store level. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>> Sophie > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Wed, May 6, 2020 at 8:41 AM Bruno Cadonna < > >>>> br...@confluent.io > >>>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I'd like to discuss KIP-607 that aims to add RocksDB memory > >>>>>>>>> usage > >>>>>>>>>>>>>>> metrics to Kafka Streams. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Record+the+Memory+Used+by+RocksDB+to+Kafka+Streams > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>> Bruno > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> -- Guozhang > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> -- Guozhang > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> -- Guozhang > >>>> > >>> > >> > > > > > -- -- Guozhang