I was able to replicate this on a simpler example that essentially does nothing but hook into TS similar to a buffering null transform in both directions. I will play with it a bit and once I have some colleges take a quick look I did not mess anything up, I'll file a bug so that others might take a look. Without the plugin many threads performs better, with the plugin fewer threads performs better. Seems there is latency in handling the threads.
Essentially I am seeing this when holding the load generation constant (40 connections with rapid very small GET and POST): Default 4.2.2 config as reverse proxy: 17,000 tx/s Same, plugin loaded: 5,000 tx/s Same, plugin loaded, threads dropped to 4: 7,000 tx/s Same, plugin loaded, threads dropped to 2: 10,000 tx/s I see the reverse when the plugin is not loaded and I reduce threads: Default 4.2.2 config as reverse proxy: 17,000 tx/s Same, threads dropped to 4: 16,000 tx/s Same, threads dropped to 2: 14,000 tx/s Really no CPU usage change in any tests. Cheers! -B -- Brian Rectanus On Wed, Mar 11, 2015 at 2:41 AM, Brian Geffon <bri...@apache.org> wrote: > I've also observed unexplained latency when it comes to transformations, I > think it's time that we dig into this more. The reason we're observing an > increase in latency without a corresponding increase in CPU load is because > TS simply isn't doing anything, it appears that it's just rescheduling > transformations in certain situations. > > Does anyone have cycles to investigate? > > On Wed, Mar 11, 2015 at 12:31 AM, Brian Rectanus <brect...@gmail.com> > wrote: > > > All, > > > > I am looking for advice on tuning performance of a plugin. As some may > > know, I have a plugin for trafficserver (using 4.2.2 w/hwloc) that does a > > lot of inspection of the http traffic (github/ironbee). As such, it can > > introduce a fair amount of latency due to what should be high CPU usage > > parsing, normalizing and looking for various patterns in the HTTP. This > is > > what I expect to see at least, but that is not how the server is acting. > > > > What I am seeing: > > > > * Without plugin loaded I see great performance and machine basically > idle > > (4% cpu or so) - I am using Ixia's ixLoad to generate a very consistent > > load. > > > > * With plugin loaded and fully configured, the machine is slightly more > > than idle (12% cpu or so), but transaction per sec drops by 25x. > > > > * Default threads settings of 1.5 * cores seems to be very poor setting > > (50x slower). Setting manually obscenely higher threads (200) works a bit > > better, but best setting is 2 threads (1 is bad, 3 is bad, but 2 works > some > > 7-10 times faster). Using accept threads is also very poor. > > > > Best balance (far better than others) is: > > > > CONFIG proxy.config.exec_thread.autoconfig INT 0 > > CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1.5 > > CONFIG proxy.config.exec_thread.limit INT 2 > > CONFIG proxy.config.accept_threads INT 0 > > CONFIG proxy.config.exec_thread.affinity INT 2 > > CONFIG proxy.config.task_threads INT 3 > > > > Everything else is pretty much default - caching is disabled. The above > is > > about 15x faster (in tx/s) than the default settings. > > > > * Profiling (perf and Zoom profiler) with the 2 thread max setting shows > > that two threads are active, one far more than the other > > > > * Profiling with the 1.5 x cores (e.g., 12 in this case as there are 8 > > cores) shows 4-5 threads active, but far less active than with the 2 core > > max setting - most threads are always idle > > > > * First thought was blocking and lock contention, but there does not seem > > to be any seen with the profiler. > > > > * Next thought was malloc() speed issues, so tried jemalloc (and > tcmalloc) > > which helps slightly, but not much (we use memory pools, so much is > > pre-allocated anyhow) > > > > Attached a screenshot of the profiler timeline, but not sure it will come > > through on the list. The plugin does not block, but should be using lots > of > > CPU for parsing, running regex, etc. It also uses a lot of extra RAM for > > normalizing HTTP, etc. However I am not seeing high CPU nor am I seeing > > high RAM usage. It is like it just cannot get CPU, but the system is > idle - > > more threads I add, the less it gets CPU as if the extra accounting is > > getting in the way. > > > > * I expect high CPU utilization, but the machine is mostly idle. > > * I expect all the cores (8 of them) to get used, but really only 1-2 are > > somewhat used. > > * I expect the threads to be saturated with work, but they are mostly > idle. > > > > Any ideas why the complete lack of CPU/thread utilization? > > > > Any ideas what to look at? > > > > Any ideas what I can enable (tools I could use) to see more insight into > > what is happening? > > > > Cheers! > > -B > > > > -- > > Brian Rectanus > > >