Hi, On Thu, Jun 4, 2020 at 9:28 PM Rémy Maucherat <r...@apache.org> wrote:
> On Thu, Jun 4, 2020 at 6:16 PM Mark Thomas <ma...@apache.org> wrote: > >> On 04/06/2020 16:33, Martin Grigorov wrote: >> > Hi team, >> > >> > I am profiling our application and I've found that Tomcat uses its own >> > Synchronized Queue/Stack implementations which synchronize on instance >> > level, i.e. their methods are 'synchronized'. >> > > Can you work off 10 (master) first ? > The application is based on Spring Boot and my attempts to "jakartaee migrate it" were not successful. But I've created a very simple Servlet and I will try it on Tomcat 10.x soon: @WebServlet(urlPatterns = "/test") public class PlainTextServlet extends HttpServlet { public static final String CONTENT = "Hello world!"; public static final int CONTENT_LENGTH = CONTENT.length(); @Override protected void doGet(final HttpServletRequest req, final HttpServletResponse resp) throws IOException { resp.setContentType("text/plain"); resp.setContentLength(CONTENT_LENGTH); resp.getWriter().println(CONTENT); } } For load testing it I use wrk: wrk -c96 -t8 -d60s https://host:port/test The GC JVM arguments are: -Xloggc:./gc.log -XX:+PrintGCDetails -verbose:gc JDK 1.8.0 b252 The GC logs are: - Tomcat 9.0.x (uses SynchronizedQueue/Stack): https://gist.github.com/martin-g/d2570e7a6896e4d094ce548ceea3adb6 - Tomcat 9.0.x with my changes: https://gist.github.com/martin-g/52c7d3a883b37e9bcd11ad6430800852 I've uploaded them to https://gceasy.io/ and the charts are similar. The summary from wrk with SynchronizedQueue/Stack is: Running 1m test @ http://192.168.0.206:8080 8 threads and 96 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.24ms 788.15us 19.46ms 91.44% Req/Sec 10.04k 1.64k 12.45k 79.48% 4797193 requests in 1.00m, 600.19MB read Requests/sec: 79834.45 Transfer/sec: 9.99MB I didn't preserve the summary from the lock-free run but it was similar - the throughput is around 80K/sec +- few thousands. I.e. the throughput does not improve after my changes! Only the locking profiling results become better (Async Profiler: ./profile.sh -e lock -d50s -f lock-results.txt PID) Martin > >> >> ~8 years ago when I was profiling Tomcat and wrote those having >> something targetted to Tomcat's specific use case made a measurable >> improvement. >> > > Yes, the SyncStack was the go to structure for all reuse. Not sure if it's > the best. > > I already removed some of the usage in 10: > - No more block poller and its pooling/recycling [I had to pull its > removal out of 9 due to an unfortunate regression, this could be brought > back now that it seems to work] > - No recycling of trivial objects > > The main use should now be the recycled processors and the recycled > channels (it would be bad to not reuse them since they may use direct > buffers ...). > > >> >> I am more than prepared to believe that things have moved on since then >> but I'd be interested in seeing the same numbers with the latest Java 8 >> from Adopt OpenJDK rather than Java 14. >> > > +1, firnally there could be some differences between JDK, like now they > started to actually improve NIO 1 again. > > Rémy > > >> >> I'd also like to see overall throughput numbers and the GC frequency for >> each test scenario. >> >> Mark >> >> >> >> > >> > The javadoc for those two classes say that they are preferred because >> > they are GC-friendlier than >> > java.uti.concurrent.ConcurrentLinkedQueue. >> > >> > I use Async Profiler >> > (https://github.com/jvm-profiling-tools/async-profiler) to profile CPU, >> > memory, >> > locks and few lower level types (from perf). >> > Issuing "./profiler.sh -e lock -d 60 PID" against 9.0.x shows such >> reports: >> > >> > Started [lock] profiling >> > >> > --- Execution profile --- >> > >> > Total samples : 713834 >> > >> > >> > >> > Frame buffer usage : 0.0288% >> > >> > >> > --- >> > 1106165618177 ns (58.97%), 395609 samples >> > >> > [ 0] >> > org.apache.tomcat.util.collections.SynchronizedStack >> > >> > [ 1] >> > org.apache.tomcat.util.collections.SynchronizedStack.push >> > >> > [ 2] >> > org.apache.tomcat.util.net.NioBlockingSelector.write >> >> > >> > [ 3] >> > org.apache.tomcat.util.net.NioSelectorPool.write >> >> > >> > [ 4] >> > org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite >> > >> > [ 5] >> > org.apache.tomcat.util.net.SocketWrapperBase.doWrite >> >> > >> > [ 6] >> > org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking >> >> > >> > [ 7] >> > org.apache.tomcat.util.net.SocketWrapperBase.flush >> >> > >> > [ 8] >> > org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end >> > [ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end >> > >> > [10] org.apache.coyote.http11.Http11OutputBuffer.end >> > [11] org.apache.coyote.http11.Http11Processor.finishResponse >> > [12] org.apache.coyote.AbstractProcessor.action >> > [13] org.apache.coyote.Response.action >> > [14] org.apache.catalina.connector.OutputBuffer.close >> > [15] org.apache.catalina.connector.Response.finishResponse >> > [16] org.apache.catalina.connector.CoyoteAdapter.service >> > [17] org.apache.coyote.http11.Http11Processor.service >> > [18] org.apache.coyote.AbstractProcessorLight.process >> > [19] org.apache.coyote.AbstractProtocol$ConnectionHandler.process >> > [20] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun >> > [21] org.apache.tomcat.util.net.SocketProcessorBase.run >> > [22] java.util.concurrent.ThreadPoolExecutor.runWorker >> > [23] java.util.concurrent.ThreadPoolExecutor$Worker.run >> > [24] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run >> > [25] java.lang.Thread.run >> > >> > --- 769519847169 ns (41.03%), 314384 samples >> > [ 0] org.apache.tomcat.util.collections.SynchronizedStack >> > [ 1] org.apache.tomcat.util.collections.SynchronizedStack.pop >> > [ 2] org.apache.tomcat.util.net.NioBlockingSelector.write >> > [ 3] org.apache.tomcat.util.net.NioSelectorPool.write >> > [ 4] org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite >> > [ 5] org.apache.tomcat.util.net.SocketWrapperBase.doWrite >> > [ 6] org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking >> > [ 7] org.apache.tomcat.util.net.SocketWrapperBase.flush >> > [ 8] >> org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end >> > [ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end >> > [10] org.apache.coyote.http11.Http11OutputBuffer.end >> > [11] org.apache.coyote.http11.Http11Processor.finishResponse >> > .... >> > >> > >> > I've made the following changes >> > - >> https://github.com/apache/tomcat/compare/9.0.x...martin-g:feature/lock-free-queue-and-stack >> > and ran the test suite - all green! >> > >> > Now the output from the locks profiling looks like: >> > >> > >> > --- 72354435197 ns (47.98%), 4008 samples >> > [ 0] java.util.concurrent.locks.ReentrantLock$NonfairSync >> > [ 1] jdk.internal.misc.Unsafe.park >> > [ 2] java.util.concurrent.locks.LockSupport.park >> > [ 3] java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire >> > [ 4] >> > >> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly >> > [ 5] java.util.concurrent.locks.ReentrantLock$Sync.lockInterruptibly >> > [ 6] java.util.concurrent.locks.ReentrantLock.lockInterruptibly >> > [ 7] java.util.concurrent.LinkedBlockingQueue.poll >> > [ 8] org.apache.tomcat.util.threads.TaskQueue.poll >> > [ 9] org.apache.tomcat.util.threads.TaskQueue.poll >> > [10] java.util.concurrent.ThreadPoolExecutor.getTask >> > [11] java.util.concurrent.ThreadPoolExecutor.runWorker >> > [12] java.util.concurrent.ThreadPoolExecutor$Worker.run >> > [13] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run >> > [14] java.lang.Thread.run >> > >> > --- 58413748017 ns (38.74%), 10179 samples >> > [ 0] java.lang.Object >> > [ 1] >> org.apache.catalina.core.StandardContext$ContextFilterMaps.asArray >> > [ 2] org.apache.catalina.core.StandardContext.findFilterMaps >> > [ 3] >> org.apache.catalina.core.ApplicationFilterFactory.createFilterChain >> > [ 4] org.apache.catalina.core.StandardWrapperValve.invoke >> > [ 5] org.apache.catalina.core.StandardContextValve.invoke >> > [ 6] org.apache.catalina.authenticator.AuthenticatorBase.invoke >> > [ 7] org.apache.catalina.core.StandardHostValve.invoke >> > [ 8] org.apache.catalina.valves.ErrorReportValve.invoke >> > [ 9] org.apache.catalina.core.StandardEngineValve.invoke >> > [10] org.apache.catalina.connector.CoyoteAdapter.service >> > [11] org.apache.coyote.http11.Http11Processor.service >> > [12] org.apache.coyote.AbstractProcessorLight.process >> > [13] org.apache.coyote.AbstractProtocol$ConnectionHandler.process >> > [14] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun >> > [15] org.apache.tomcat.util.net.SocketProcessorBase.run >> > [16] java.util.concurrent.ThreadPoolExecutor.runWorker >> > [17] java.util.concurrent.ThreadPoolExecutor$Worker.run >> > [18] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run >> > [19] java.lang.Thread.run >> > >> > --- 5856040617 ns (3.88%), 11395 samples >> > [ 0] org.apache.tomcat.util.collections.SynchronizedQueue >> > [ 1] org.apache.tomcat.util.collections.SynchronizedQueue.offer >> > [ 2] org.apache.tomcat.util.net.NioEndpoint$Poller.addEvent >> > [ 3] org.apache.tomcat.util.net.NioEndpoint$Poller.add >> > [ 4] >> > org.apache.tomcat.util.net >> .NioEndpoint$NioSocketWrapper.registerReadInterest >> > [ 5] org.apache.coyote.AbstractProtocol$ConnectionHandler.process >> > ... >> > >> > >> > I monitor the GC both with -Xloggc:gc.log and with JMX and the patterns >> > are the same, no new spikes. >> > JDK 14 with G1GC. >> > >> > I load the application with wrk (https://github.com/wg/wrk/) and it >> > makes 70/80K per second. >> > >> > I guess there is a reason to use these custom synchronized data >> > structures but so far I don't see any drawbacks after moving to the >> > lock-free impls. >> > >> > Regards, >> > Martin >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org >> For additional commands, e-mail: dev-h...@tomcat.apache.org >> >>