Hi Yun, Thanks for your detailed description about the progress of Flink and RocksDB's community. There're more than 1,200 jobs using RocksDB as the state backend at Bytedance, and we do met several problems mentioned in the JIRA issues you referred:
(1) Memory Management: for large-scale jobs(10TB+ state), it's hard to tune the memory usage due to non-restrict memory control on RocksDB. And currently we have to manually estimate the memory usage based on RocksDB's wiki, which increases our maintainence's cost a lot. (1) DeleteRange Support: we've made a few benchmarks on the performance of rescaling and found out the time cost is up to a few minutes when a task's state is larger than 10GB. I'm glad to see such improvements being merged after upgrading RocksDB's version. (3) ARM support: we've supported ARM's platform on our own last year by some hacking on the codes, and it's great to see that RocksDB has an official release on ARM platform. I think the new features(or bugfix) are more important for us and I'm +1 for this. Best, Jiayi Liao On Thu, Aug 5, 2021 at 1:50 AM Yun Tang <myas...@live.com> wrote: > Hi Yuval, > > Upgrading RocksDB version is a long story since Flink-1.10. > When we first plan to introduce write buffer manager to help control the > memory usage of RocksDB, we actually wanted to bump up to RocksDB-5.18 from > current RocksDB-5.17. However, we found performance regression in our micro > benchmark on state operations [1] if bumped to RocksDB-5.18. We did not > figure the root cause at that time and decide to cherry pick the commits of > write buffer manager to our own FRocksDB [2]. And we finally released our > own frocksdbjni-5.17.2-artisans-2.0 at that time. > > As time goes no, more and more bugs or missed features have been reported > in the old RocksDB version. Such as: > > 1. Cannot support ARM platform [3] > 2. Dose not have stable deleteRange API, which is useful for Flink > scale out [4] > 3. Cannot support strict block cache [5] > 4. Checkpoint might stuck if using UNIVERSVAL compaction strategy [6] > 5. Uncontrolled log size make us disabled the RocksDB internal LOG [7] > 6. RocksDB's optimizeForPointLookup option might cause data lost [8] > 7. Current dummy entry used for memory control in RocksDB-5.17 is too > large, leading performance problem [9] > 8. Cannot support alpine-based images. > 9. ....... > > Some of the bugs are walked around, and some are still open. > > And we decide to make some changes from Flink-1.12. First of all, we > reported the performance regression problem compared with RocksDB-5.18 and > RocksDB-5.17 to RocksDB community [10]. However, as RocksDB-5.x versions > are a bit older for the community, and RocksJava usage might not be the > core part for facebook guys, we did not get useful replies. Thus, we decide > to figure out the root cause of performance regression by ourself. > Fortunately, we find the cause via binary search the commits among > RocksDB-5.18 and RocksDB-5.17, and updated in the original thread [10]. To > be short, the performance regression is due to different implementation of > `__thread` and `thread_local` in gcc and would have more impact on dynamic > loading [11], which is also what current RocksJava jar package does. With > my patch [12], the performance regression would disappear if comparing > RocksDB-5.18 with RocksDB-5.17. > > Unfortunately, RocksDB-5.18 still has many bugs and we want to bump to > RocksDB-6.x. However, another performance regression appeared even with my > patch [12]. With previous knowledge, we know that we must verify the built > .so files with our java-based benchmark instead of using RocksDB built-in > db-bench. I started to search the 1340+ commits from RocksDB-5.18 to > RocksDB-6.11 to find the performance problem. However, I did not figure out > the root cause after spending several weeks this time. The performance > behaves up and down in those commits and I cannot get *the commit *which > lead the performance regression. Take this commit of integrating block > cache tracer in block-based table reader [13] for example, I noticed that > this commit would cause a bit performance regression and that might be the > useless usage accounting in operations, however, the problematic code was > changed in later commits. Thus, after several weeks digging, I have to give > up for the endless searching in the thousand commits temporarily. As > RocksDB community seems not make the project management system public, > unlike Apache's open JIRA systems, we do not know what benchmark they > actually run before releasing each version to guarantee the performance. > > With my patch [10] on latest RocksDB-6.20.3, we could get the results on > nexmark in the original thread sent by Stephan, and we can see the > performance behaves closely in many real-world cases. And we also hope new > features, such as direct buffer supporting [14] in RocksJava could help > improve RocksDB's performance in the future. > > Hope this could explain what we already did. > > > [1] https://github.com/apache/flink-benchmarks > [2] https://github.com/ververica/frocksdb/tree/FRocksDB-5.17.2 > [3] https://issues.apache.org/jira/browse/FLINK-13598 > [4] https://issues.apache.org/jira/browse/FLINK-21321 > [5] https://github.com/facebook/rocksdb/issues/6247 > [6] https://issues.apache.org/jira/browse/FLINK-21726 > [7] https://issues.apache.org/jira/browse/FLINK-15068 > [8] https://issues.apache.org/jira/browse/FLINK-17800 > [9] https://github.com/facebook/rocksdb/pull/5175 > [10] https://github.com/facebook/rocksdb/issues/5774 > [11] http://david-grs.github.io/tls_performance_overhead_cost_linux/ > [12] https://github.com/ververica/frocksdb/pull/19 > [13] https://github.com/facebook/rocksdb/pull/5441/ > [14] https://github.com/facebook/rocksdb/pull/2283 > > > Best, > Yun Tang > > > On Wed, Aug 4, 2021 at 2:36 PM Yuval Itzchakov <yuva...@gmail.com> wrote: > > We are heavy users of RocksDB and have had several issues with memory > managed in Kubernetes, most of them actually went away when we upgraded > from Flink 1.9 to 1.13. > > Do we know why there's such a huge performance regression? Can we improve > this somehow with some flag tweaking? It would be great if we see a more in > depth explanation of the gains vs losses of upgrading. > > On Wed, Aug 4, 2021 at 3:08 PM Stephan Ewen <se...@apache.org> wrote: > > > > ------------------------------ > *From:* Nico Kruber <n...@apache.org> > *Sent:* Thursday, August 5, 2021 0:10 > *To:* u...@flink.apache.org <u...@flink.apache.org>; dev < > dev@flink.apache.org> > *Subject:* Re: [ANNOUNCE] RocksDB Version Upgrade and Performance > > That's actually also what I'm seeing most of the time and what I'd expect > to > improve with the newer RocksDB version. > Hence, I'd also favour the upgrade even if there is a slight catch with > respect to performance - we should, however, continue to investigate this > together with the RocksDB community. > > > Nico > > On Wednesday, 4 August 2021 14:26:32 CEST David Anderson wrote: > > I am hearing quite often from users who are struggling to manage memory > > usage, and these are all users using RocksDB. While I don't know for > > certain that RocksDB is the cause in every case, from my perspective, > > getting the better memory stability of version 6.20 in place is critical. > > > > Regards, > > David > > > > On Wed, Aug 4, 2021 at 8:08 AM Stephan Ewen <se...@apache.org> wrote: > > > Hi all! > > > > > > *!!! If you are a big user of the Embedded RocksDB State Backend and > have > > > performance sensitive workloads, please read this !!!* > > > > > > I want to quickly raise some awareness for a RocksDB version upgrade we > > > plan to do, and some possible impact on application performance. > > > > > > *We plan to upgrade RocksDB to version 6.20.* That version of RocksDB > > > unfortunately introduces some non-trivial performance regression. In > our > > > Nexmark Benchmark, at least one query is up to 13% slower. > > > With some fixes, this can be improved, but even then there is an > overall > > > *regression up to 6% in some queries*. (See attached table for results > > > from relevant Nexmark Benchmark queries). > > > > > > We would do this update nonetheless, because we need to get new > features > > > and bugfixes from RocksDB in. > > > > > > Please respond to this mail thread if you have major concerns about > this. > > > > > > > > > *### Fallback Plan* > > > > > > Optionally, we could fall back to Plan B, which is to upgrade RocksDB > only > > > to version 5.18.4. > > > Which has no performance regression (after applying a custom patch). > > > > > > While this spares us the performance degradation of RocksDB 6.20.x, > this > > > > > > has multiple disadvantages: > > > - Does not include the better memory stability (strict cache control) > > > - Misses out on some new features which some users asked about > > > - Does not have the latest RocksDB bugfixes > > > > > > The latest point is especially bad in my opinion. While we can > cherry-pick > > > some bugfixes back (and have done this in the past), users typically > run > > > into an issue first and need to trace it back to RocksDB, then one of > the > > > committers can find the relevant patch from RocksDB master and backport > > > it. > > > That isn't the greatest user experience. > > > > > > Because of those disadvantages, we would prefer to do the upgrade to > the > > > newer RocksDB version despite the unfortunate performance regression. > > > > > > Best, > > > Stephan > > > -- > Dr. Nico Kruber | Solutions Architect > > Follow us @VervericaData Ververica > -- > Join Flink Forward - The Apache Flink Conference > Stream Processing | Event Driven | Real Time > -- > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton > Wehner > > >