Hi everyone,

I've been a busy bee working on getting a performance testing cluster
ready for regular performance tests. I'm happy to say that there is a
successful-looking load-test running right now, though there is more
work to do on the tests themselves and the automation.

Here is a summary of next steps that are on my radar:

* Collection and organization of all information available (as noted
under PROFILING / MONITORING), and dumping them to a date/time-named
directory on the load-driver machine for web access
* Orchestrating of cron jobs (or just put together a synchronous
script) to re-execute tests on a regular basis
* Put together an automated Perf4J-enabled Jenkins build that can be
used as the performance-testing platform for API execution / timing
data
* Ongoing tuning of performance test scripts
* Incrementally add data to the performance testing environment as
needed using the model loader
* Look at use of "Tsung Plotter" to generate comparison graphs of
metrics over multiple runs

Here is the Tsung data being generated right now:
http://oae-loader.sakaiproject.org/20120807-2224/report.html
Here are some pointers on how to read it:
http://tsung.erlang-projects.org/user_manual.html#htoc70

I've provided a detailed update and "practice documentation" below on
the different facets that pull the testing together. As always
questions and suggestions are welcome.

ENVIRONMENT

Special thanks to Kyle and Erik with rSmart for all their help getting
an Amazon cluster spinning for us. Our cluster is running like a
champ. Configuration and server deployment is full managed with puppet
[1], and puppet rocks. This is what the cluster looks like:

2x App server node (one is sleeping for now)
1x Postgres node
1x Solr node
1x Preview processor node
1x Apache node
------
1x Load-test driver node

The Load-test driver node is the client machine that actually runs the
load tests, so not a server component.

PROFILING / MONITORING

Operating system resources are being monitored and recorded using
Munin [2]. Munin is a popular open-source tool that lets you record
more OS (and auxiliary stuff like apache processes) than I knew
existed. More importantly, Tsung integrates with it quite well with it
to pull synchronized OS metrics (CPU, Load, Memory) from *all* cluster
nodes during the performance test. Those metrics now become an
artifact of the Tsung load-testing reports.

YourKit [3] has been worked in to the puppet scripts, such that it is
painless to enable it for a more in-depth look at the app server. If
enabled, YourKit will run as a Java Agent on the app server, and will
dump a snapshot when the JVM is shut down -- presumably after the
performance run. This allows us to get point-in-time analysis of CPU
and Thread telemetry throughout the performance test. We can  enable
object allocation analysis as well, if needed. YourKit is not enabled
by default because it may introduce performance degradation during the
tests.

At this very moment in time the OS resources are the only server
profiling metrics being packaged on every single performance test.
Immediate plans are to automate the packaging of the following, which
are already being recorded regularly on the servers:

* Verbose GC logs of the app server node and Solr
* Full app server log
* Postgres slow-query log
* Grab a snapshot of the Solr admin console to see solr cache usage
* Grab a snapshot of the /system/telemetry page
* Perf4J logs. This will provide (unsynchronized) time-series API
execution timing as well as additional telemetry throughout the test.
* Maybe a full Munin cluster snapshot, but it is probably not granular
enough to be useful.

LOAD TESTING RESULTS

Getting the client-side load-testing results with Tsung is a
no-brainer. It generates the static HTML pages that I'm currently
hosting here: http://oae-loader.sakaiproject.org/  - I will be
restructuring these directories as I automate the collection of more
information (i.e., logs). But if you go into any directory and click
"report.html", that will get you started.

The most valuable information I've found so far from the Tsung reports:

Transactions. If we hold a convention to separate each user-facing
action (e.g., "View Contacts") by a unique transaction name when
writing tests, this provides a good view into how well those
individual actions perform. We will need to edit a couple test cases
to be consistent on the "transaction id" so we can get more accurate
data -- for example, the highest 10sec mean of "tr_login" is over a
minute, while the smallest is less than a second, which is because
there are multiple transactions keyed with "login" that perform many
more requests than the others. Also, the "Count" column gives a good
indication of action distribution across user sessions and can help
guide balancing the tests properly.

Load. Simply the load on each cluster member throughout the load test.

HTTP Status Code. The number of HTTP Status codes, time-series across
the duration of the test.

DATA LOADING

For regular performance tests, snapshots have been taken of
previously-loaded data that can be dumped back in, which is much
faster than using the model builder. This has already been automated,
and can be run nightly in preparation for load tests.

Data is loaded into the server using the OAE-model-loader [4]. I've
recently submit a PR which allows us to incrementally load data
batches on top of one another to step up the content when needed. The
current test that is running is using 500 users, 250 worlds. I can
load in another batch of 500/250 and take a snapshot. Then another.
Additionally, the data that feeds into the load-tests (i.e., the CSV
files) will are easily generated from the OAE-model-loader source
scripts, and can be done incrementally as well, therefore we can
successfully performance test each increment. Here is an example
workflow:

1. Use generate.js to generate 10 batches of data, 500 users 900
worlds per batch.
--
2. Load 1 batch of data into OAE: node loaddata.js -b 1
3. Assemble the package of load-testing CSV data that goes along with
that data: node performance-testing/package.js -b 1 -s .
4. Run performance test with CSV data
--
5. Load a second batch of data into OAE: node loaddata.js -s 1 -b 2
6. Assemble the package of load-testing CSV data that goes along with
what is currently in the OAE environment: node
performance-testing/package.js -b 2 -s .
7. Run performance test with new CSV data

You get the picture.

Plan is to incrementally add content to become a data-set comparable
to the reference environment [5], but doing it all at once was
unstable.

AUTOMATION

This is where I'm at now that I'm actually spinning tests. What is
currently automated:

* Purging/Restoring data, and bouncing the servers for a new test. I
imagine this can be a cronjob that could execute about 15min before
the Tsung cron-job is kicked off.
* Publication of Tsung results. This is only because I'm simply
pointing Apache to the Tsung performance results directory.

Next steps for automation:

* Collection and organization of all information available (as noted
under PROFILING / MONITORING), and dumping them to a date/time-named
directory on the load-driver machine.
* Orchestrating of cron jobs (or just put together a synchronous
script) to re-execute tests on a regular basis
* Put together an automated Perf4J-enabled Jenkins build that can be
used as the performance-testing platform for API execution / timing
data

Thanks for reading! :)

-- 
Cheers,
Branden
_______________________________________________
oae-dev mailing list
oae-dev@collab.sakaiproject.org
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Reply via email to