[
https://issues.apache.org/jira/browse/SAMZA-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319620#comment-15319620
]
Yi Pan (Data Infrastructure) commented on SAMZA-957:
----------------------------------------------------
Merged and submitted. Thanks!
> Avoid unnecessary KV Store flushes (part 3)
> -------------------------------------------
>
> Key: SAMZA-957
> URL: https://issues.apache.org/jira/browse/SAMZA-957
> Project: Samza
> Issue Type: Bug
> Reporter: Jake Maes
> Assignee: Jake Maes
> Fix For: 0.10.1
>
> Attachments: SAMZA-957_1.patch
>
>
> We had an issue where RocksDB performance severely degraded for 23 hours and
> then resolved itself. To troubleshoot the issue I gathered some samples of
> the compaction stats from the RocksDB log and engaged with the RocksDB team
> via an existing, related issue:
> https://github.com/facebook/rocksdb/issues/696#issuecomment-222549220
> They pointed out that the job was flushing excessively:
> {quote}
> If you overload RocksDB with work (i.e. do bunch of writes really fast, or in
> your case, bunch of small flushes), it will begin stalling writes while the
> compactions (deferred work) completes. An interesting thing with RocksDB and
> LSM architecture is that the more behind you are on compactions, the more
> expensive the compactions are (due to increased write amplifications and
> single-threadness of L0->L1 compaction). So our write stalls have to be tuned
> exactly right for RocksDB to behave well with extremely high write rate.
> {quote}
> Looking through our commit history I see that SAMZA-812 and SAMZA-873 have
> both intended to address this issue, by reducing the amount of flushes in
> CachedStore.
> To be fair, the job in question did not have the SAMZA-873 patch, but I see
> even more room for improvement. Namely, CachedStore should *never* flush the
> underlying store unless its flush() was called. It can purge its dirty items
> to trade off performance for correctness, but flushing is excessive. So, this
> patch will remove the flushes from the all() and range() methods, simplify
> the LRU logic, and add a good unit test to verify and explain the proper LRU
> behavior.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)