Thanks, Patrick. Yes, we are using the same JVM version and GC configurations when running the two tests. I have checked the GC metrics and also the heap dump of the 3.6, the GC pause and the memory usage look okay.
Best, Li On Sun, Feb 21, 2021 at 3:34 PM Patrick Hunt <ph...@apache.org> wrote: > On Sun, Feb 21, 2021 at 3:28 PM Li Wang <li4w...@gmail.com> wrote: > > > Hi Enrico, Sushant, > > > > I re-run the perf test with the data consistency check feature disabled > > (i.e. -Dzookeeper.digest.enabled=false), the write performance issue of > 3.6 > > is still there. > > > > With everything exactly the same, the throughput of 3.6 was only 1/2 of > 3.4 > > and the max latency was more than 8 times. > > > > Any other points or thoughts? > > > > > In the past I've noticed a big impact of GC when doing certain performance > measurements. I assume you are using the same JVM version and GC when > running the two tests? Perhaps our memory footprint has expanded over time. > You should rule out GC by running with gc logging turned on with both > versions and compare the impact. > > Regards, > > Patrick > > > > Cheers, > > > > Li > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Feb 20, 2021 at 9:04 PM Li Wang <li4w...@gmail.com> wrote: > > > > > Thanks Sushant and Enrico! > > > > > > This is a really good point. According to the 3.6 documentation, the > > > feature is disabled by default. > > > > > > https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html#ch_administration > > . > > > However, checking the code, the default is enabled. > > > > > > Let me set the zookeeper.digest.enabled to false and see how the write > > > operation performs. > > > > > > Best, > > > > > > Li > > > > > > > > > > > > > > > On Fri, Feb 19, 2021 at 1:32 PM Sushant Mane <sushantma...@gmail.com> > > > wrote: > > > > > >> Hi Li, > > >> > > >> On 3.6.2 consistency checker (adhash based) is enabled by default: > > >> > > >> > > > https://github.com/apache/zookeeper/blob/803c7f1a12f85978cb049af5e4ef23bd8b688715/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L136 > > >> . > > >> It is not present in ZK 3.4.14. > > >> > > >> This feature does have some impact on write performance. > > >> > > >> Thanks, > > >> Sushant > > >> > > >> > > >> On Fri, Feb 19, 2021 at 12:50 PM Enrico Olivelli <eolive...@gmail.com > > > > >> wrote: > > >> > > >> > Li, > > >> > I wonder of we have some new throttling/back pressure mechanisms > that > > is > > >> > enabled by default. > > >> > > > >> > Does anyone has some pointer to relevant implementations? > > >> > > > >> > > > >> > Enrico > > >> > > > >> > Il Ven 19 Feb 2021, 19:46 Li Wang <li4w...@gmail.com> ha scritto: > > >> > > > >> > > Hi, > > >> > > > > >> > > We switched to Netty on both client side and server side and the > > >> > > performance issue is still there. Anyone has any insights on what > > >> could > > >> > be > > >> > > the cause of higher latency? > > >> > > > > >> > > Thanks, > > >> > > > > >> > > Li > > >> > > > > >> > > > > >> > > > > >> > > On Mon, Feb 15, 2021 at 2:17 PM Li Wang <li4w...@gmail.com> > wrote: > > >> > > > > >> > > > Hi Enrico, > > >> > > > > > >> > > > > > >> > > > Thanks for the reply. > > >> > > > > > >> > > > > > >> > > > 1. We are using NIO based stack, not Netty based yet. > > >> > > > > > >> > > > 2. Yes, here are some metrics on the client side. > > >> > > > > > >> > > > > > >> > > > 3.6: throughput: 7K, failure: 81215228, Avg Latency: 57ms, Max > > >> Latency > > >> > > 31s > > >> > > > > > >> > > > 3.4: throughput: 15k, failure: 0, Avg Latency: 30ms, Max > > Latency: > > >> > 1.6s > > >> > > > > > >> > > > > > >> > > > 3. Yes, the JVM and zoo.cfg config are the exact same > > >> > > > > > >> > > > 10G of Heap > > >> > > > > > >> > > > 13G of Memory > > >> > > > > > >> > > > 5 Participante > > >> > > > > > >> > > > 5 Observere > > >> > > > > > >> > > > Client session timeout: 3000ms > > >> > > > > > >> > > > Server min session time: 4000ms > > >> > > > > > >> > > > > > >> > > > > > >> > > > 4. Yes, there are two types of WARN logs and many “Expiring > > >> session” > > >> > > > INFO log > > >> > > > > > >> > > > > > >> > > > 2021-02-15 22:04:36,506 [myid:4] - WARN > > >> > > > [NIOWorkerThread-7:NIOServerCnxn@365] - Unexpected exception > > >> > > > > > >> > > > EndOfStreamException: Unable to read additional data from > client, > > it > > >> > > > probably closed the socket: address = /100.108.63.116:43366, > > >> session = > > >> > > > 0x400189fee9a000b > > >> > > > > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:164) > > >> > > > > > >> > > > at > > >> > > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:327) > > >> > > > > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) > > >> > > > > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > > >> > > > > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > >> > > > > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > > >> > > > > > >> > > > at java.base/java.lang.Thread.run(Thread.java:834) > > >> > > > > > >> > > > > > >> > > > 2021-02-15 22:05:14,428 [myid:4] - WARN > > >> > > > [SyncThread:4:SyncRequestProcessor@188] - Too busy to snap, > > >> skipping > > >> > > > > > >> > > > > > >> > > > 2021-02-15 22:01:51,427 [myid:4] - INFO > > >> > > > [SessionTracker:ZooKeeperServer@610] - Expiring session > > >> > > > 0x400189fee9a001e, timeout of 4000ms exceeded > > >> > > > > > >> > > > > > >> > > > > > >> > > > 5. Yes we upgrade both the client and the server to 3.6. > Actually, > > >> the > > >> > > > issue happened with the combinations of > > >> > > > > > >> > > > > > >> > > > 3.4 client and 3.6 server > > >> > > > > > >> > > > 3.6 client and 3.6 server > > >> > > > > > >> > > > Please let me know if you need any additional info. > > >> > > > > > >> > > > Thanks, > > >> > > > > > >> > > > Li > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > On Mon, Feb 15, 2021 at 1:44 PM Li Wang <li4w...@gmail.com> > > wrote: > > >> > > > > > >> > > >> Hi Enrico, > > >> > > >> > > >> > > >> Thanks for the reply. > > >> > > >> > > >> > > >> 1. We are using direct NIO based stack, not Netty based yet. > > >> > > >> 2. Yes, on the client side, here are the metrics > > >> > > >> > > >> > > >> 3.6: > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> On Mon, Feb 15, 2021 at 10:44 AM Enrico Olivelli < > > >> eolive...@gmail.com > > >> > > > > >> > > >> wrote: > > >> > > >> > > >> > > >>> IIRC The main difference is about the switch to Netty 4 and > > about > > >> > using > > >> > > >>> more DirectMemory. Are you using the Netty based stack? > > >> > > >>> > > >> > > >>> Apart from that macro difference there have been many many > > changes > > >> > > since > > >> > > >>> 3.4. > > >> > > >>> > > >> > > >>> Do you have some metrics to share? > > >> > > >>> Are the JVM configurations and zoo.cfg configuration equals > to > > >> each > > >> > > >>> other? > > >> > > >>> > > >> > > >>> Do you see warnings on the server logs? > > >> > > >>> > > >> > > >>> Did you upgrade both the client and the server or only the > > server? > > >> > > >>> > > >> > > >>> Enrico > > >> > > >>> > > >> > > >>> > > >> > > >>> Il Lun 15 Feb 2021, 18:30 Li Wang <li4w...@gmail.com> ha > > scritto: > > >> > > >>> > > >> > > >>> > Hi, > > >> > > >>> > > > >> > > >>> > We want to upgrade from 3.4.14 to 3.6.2. During the > > >> perform/load > > >> > > >>> > comparison test, it was found that the performance of 3.6 > has > > >> been > > >> > > >>> > significantly degraded compared to 3.4 for the write > > operation. > > >> > Under > > >> > > >>> the > > >> > > >>> > same load, there was a huge number of SessionExpired and > > >> > > ConnectionLoss > > >> > > >>> > errors in 3.6 while no such errors in 3.4. > > >> > > >>> > > > >> > > >>> > The load testing is 500 concurrent users with a cluster of 5 > > >> > > >>> participants > > >> > > >>> > and 5 observers. The min session timeout on the server side > is > > >> > > 4000ms. > > >> > > >>> > > > >> > > >>> > I wonder if anyone has seen the same issue and has any > > insights > > >> on > > >> > > what > > >> > > >>> > could be the cause of the performance degradation. > > >> > > >>> > > > >> > > >>> > Thanks > > >> > > >>> > > > >> > > >>> > Li > > >> > > >>> > > > >> > > >>> > > >> > > >> > > >> > > > > >> > > > >> > > > > > >