Re: [oae-dev] Performance Testing update

Branden Visser Tue, 07 Aug 2012 18:23:04 -0700

I didn't fill in *any* of my footnotes...

[1] https://github.com/mrvisser/puppet-oae-example/tree/oae-loadtesting-config
[2] http://munin-monitoring.org/
[3] http://yourkit.com/features/index.jsp
[4] https://github.com/mrvisser/OAE-model-loader/tree/performance-testing
[5] https://oae-community.sakaiproject.org/content#p=lb5CsAfmg/Performance.pdf



On Tue, Aug 7, 2012 at 9:15 PM, Branden Visser <mrvis...@gmail.com> wrote:
> Hi everyone,
>
> I've been a busy bee working on getting a performance testing cluster
> ready for regular performance tests. I'm happy to say that there is a
> successful-looking load-test running right now, though there is more
> work to do on the tests themselves and the automation.
>
> Here is a summary of next steps that are on my radar:
>
> * Collection and organization of all information available (as noted
> under PROFILING / MONITORING), and dumping them to a date/time-named
> directory on the load-driver machine for web access
> * Orchestrating of cron jobs (or just put together a synchronous
> script) to re-execute tests on a regular basis
> * Put together an automated Perf4J-enabled Jenkins build that can be
> used as the performance-testing platform for API execution / timing
> data
> * Ongoing tuning of performance test scripts
> * Incrementally add data to the performance testing environment as
> needed using the model loader
> * Look at use of "Tsung Plotter" to generate comparison graphs of
> metrics over multiple runs
>
> Here is the Tsung data being generated right now:
> http://oae-loader.sakaiproject.org/20120807-2224/report.html
> Here are some pointers on how to read it:
> http://tsung.erlang-projects.org/user_manual.html#htoc70
>
> I've provided a detailed update and "practice documentation" below on
> the different facets that pull the testing together. As always
> questions and suggestions are welcome.
>
> ENVIRONMENT
>
> Special thanks to Kyle and Erik with rSmart for all their help getting
> an Amazon cluster spinning for us. Our cluster is running like a
> champ. Configuration and server deployment is full managed with puppet
> [1], and puppet rocks. This is what the cluster looks like:
>
> 2x App server node (one is sleeping for now)
> 1x Postgres node
> 1x Solr node
> 1x Preview processor node
> 1x Apache node
> ------
> 1x Load-test driver node
>
> The Load-test driver node is the client machine that actually runs the
> load tests, so not a server component.
>
> PROFILING / MONITORING
>
> Operating system resources are being monitored and recorded using
> Munin [2]. Munin is a popular open-source tool that lets you record
> more OS (and auxiliary stuff like apache processes) than I knew
> existed. More importantly, Tsung integrates with it quite well with it
> to pull synchronized OS metrics (CPU, Load, Memory) from *all* cluster
> nodes during the performance test. Those metrics now become an
> artifact of the Tsung load-testing reports.
>
> YourKit [3] has been worked in to the puppet scripts, such that it is
> painless to enable it for a more in-depth look at the app server. If
> enabled, YourKit will run as a Java Agent on the app server, and will
> dump a snapshot when the JVM is shut down -- presumably after the
> performance run. This allows us to get point-in-time analysis of CPU
> and Thread telemetry throughout the performance test. We can  enable
> object allocation analysis as well, if needed. YourKit is not enabled
> by default because it may introduce performance degradation during the
> tests.
>
> At this very moment in time the OS resources are the only server
> profiling metrics being packaged on every single performance test.
> Immediate plans are to automate the packaging of the following, which
> are already being recorded regularly on the servers:
>
> * Verbose GC logs of the app server node and Solr
> * Full app server log
> * Postgres slow-query log
> * Grab a snapshot of the Solr admin console to see solr cache usage
> * Grab a snapshot of the /system/telemetry page
> * Perf4J logs. This will provide (unsynchronized) time-series API
> execution timing as well as additional telemetry throughout the test.
> * Maybe a full Munin cluster snapshot, but it is probably not granular
> enough to be useful.
>
> LOAD TESTING RESULTS
>
> Getting the client-side load-testing results with Tsung is a
> no-brainer. It generates the static HTML pages that I'm currently
> hosting here: http://oae-loader.sakaiproject.org/  - I will be
> restructuring these directories as I automate the collection of more
> information (i.e., logs). But if you go into any directory and click
> "report.html", that will get you started.
>
> The most valuable information I've found so far from the Tsung reports:
>
> Transactions. If we hold a convention to separate each user-facing
> action (e.g., "View Contacts") by a unique transaction name when
> writing tests, this provides a good view into how well those
> individual actions perform. We will need to edit a couple test cases
> to be consistent on the "transaction id" so we can get more accurate
> data -- for example, the highest 10sec mean of "tr_login" is over a
> minute, while the smallest is less than a second, which is because
> there are multiple transactions keyed with "login" that perform many
> more requests than the others. Also, the "Count" column gives a good
> indication of action distribution across user sessions and can help
> guide balancing the tests properly.
>
> Load. Simply the load on each cluster member throughout the load test.
>
> HTTP Status Code. The number of HTTP Status codes, time-series across
> the duration of the test.
>
> DATA LOADING
>
> For regular performance tests, snapshots have been taken of
> previously-loaded data that can be dumped back in, which is much
> faster than using the model builder. This has already been automated,
> and can be run nightly in preparation for load tests.
>
> Data is loaded into the server using the OAE-model-loader [4]. I've
> recently submit a PR which allows us to incrementally load data
> batches on top of one another to step up the content when needed. The
> current test that is running is using 500 users, 250 worlds. I can
> load in another batch of 500/250 and take a snapshot. Then another.
> Additionally, the data that feeds into the load-tests (i.e., the CSV
> files) will are easily generated from the OAE-model-loader source
> scripts, and can be done incrementally as well, therefore we can
> successfully performance test each increment. Here is an example
> workflow:
>
> 1. Use generate.js to generate 10 batches of data, 500 users 900
> worlds per batch.
> --
> 2. Load 1 batch of data into OAE: node loaddata.js -b 1
> 3. Assemble the package of load-testing CSV data that goes along with
> that data: node performance-testing/package.js -b 1 -s .
> 4. Run performance test with CSV data
> --
> 5. Load a second batch of data into OAE: node loaddata.js -s 1 -b 2
> 6. Assemble the package of load-testing CSV data that goes along with
> what is currently in the OAE environment: node
> performance-testing/package.js -b 2 -s .
> 7. Run performance test with new CSV data
>
> You get the picture.
>
> Plan is to incrementally add content to become a data-set comparable
> to the reference environment [5], but doing it all at once was
> unstable.
>
> AUTOMATION
>
> This is where I'm at now that I'm actually spinning tests. What is
> currently automated:
>
> * Purging/Restoring data, and bouncing the servers for a new test. I
> imagine this can be a cronjob that could execute about 15min before
> the Tsung cron-job is kicked off.
> * Publication of Tsung results. This is only because I'm simply
> pointing Apache to the Tsung performance results directory.
>
> Next steps for automation:
>
> * Collection and organization of all information available (as noted
> under PROFILING / MONITORING), and dumping them to a date/time-named
> directory on the load-driver machine.
> * Orchestrating of cron jobs (or just put together a synchronous
> script) to re-execute tests on a regular basis
> * Put together an automated Perf4J-enabled Jenkins build that can be
> used as the performance-testing platform for API execution / timing
> data
>
> Thanks for reading! :)
>
> --
> Cheers,
> Branden
_______________________________________________
oae-dev mailing list
oae-dev@collab.sakaiproject.org
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Re: [oae-dev] Performance Testing update

Reply via email to