Re: To scale or to be correct: that is the question

Jonathan K. Weedon Mon, 02 Jul 2001 14:55:00 -0700
It seems that the conversation about scalability is proceeding along rather
well.  Originally, I opened with some very strong statements about how
certain appservers would have trouble scaling on certain databases.

The conversation has proceeded on its own, and a number of the facts that
lead to my original statement have been corroborated, independently.  Let's
review those facts.

1) Some products do not support a "tuned" and "verified" update mechanism
in their CMP engines, which is needed to support optimistic concurrency in
the appserver.

2) In the absence of appserver-based OC, the appserver  must depend on
the database to provide OC.  By default, Oracle does not do so (that is, it
lets through "stomping" updates).

3) If you want to avoid "stomping" updates in Oracle, this means using the
serializable isolation level, instead of the default isolation level. This comes
with some performance overhead, which has yet to be quantified.

Now, let's continue with a thought experiment.

Let's consider, in the abstract, an EJB compliant application using CMP
entity beans.  Such an application will have a finite number of tables, and
each table will have a finite size.  Transactions running in this database
will access various entities, thereby accessing various rows in various
tables.

Furthermore, some transactions will only read certain entities, while other
transactions will both read and write entities.  That is, there will be a ratio
of reads and writes.  It is common to assume that there is a 10:1 ratio
between reads and writes.  The exact numbers are application dependent.

Now, let's assume we are running our application on an appserver that
does not support OC directly.  That is, it relies on the database to provide
OC.  This means that the database will be configured as using the serializable
isolation level.  Furthermore, it means that if an attempt is made to commit a
"stomping" update, the transaction will be rolled back automatically by the
database.

Now that we have all the basics in place, let's assume that there is a
magic knob which increases the load on the appserver/database.  This
knob throws more transactions at the system, by increasing the concurrency.
Let's consider what happens to our application as we turn up the knob.
How does the appserver/database scale?

First, we need to understand how an appserver that does not use "tuned"
updates does do updates.  Basically, according to the EJB spec, at the
end of the transaction, the bean is written back to the database.  Since
the appserver does not track state changes to the entity, it does not know
whether the entity was modified or not, and therefore has to make the
pessimistic assumption that it was modified.  Furthermore, since the appserver
does not know which fields were modified (if any), it has to make the
pessimistic assumption that all fields were modified.  So, at the end of the
transaction, an update is performed writing all CMP fields back to the
database.  If an appserver does "tuned" updates, it can detect which fields
were modified (if any), and only update the modified fields (or suppress
the update altogether).

[Note: some appservers have a proprietary extension to communicate to
the persistence manager whether an entity was modified.  This thought
experiment is based on the use of EJB compliant entity beans, which by
definition do not include such proprietary hooks.  Admittedly, this line
of argumentation does not hold if you are willing to litter your "container-
managed persistence" entity beans with bean-managed persistence-
tracking logic.]

So, we are ready to start "cranking up" the performance on our spec
compliant EJB application.  At a modest throughput, things are going
along swimmingly.  The optimistic assumptions about concurrency are
working for us, and we aren't seeing collisions, and all our transactions are
proceeding properly.  Very infrequently, there is an update "stomp", but
the database is configured to roll back those transactions, and it does so.

Now, we turn up the throughput knob a bit.

Now, we are running the application with higher concurrency, meaning that
we would expect an increase in optimistic concurrency collisions. Let's try
to quantify the likelihood of a collision.  Let's say that the chance that any
two transactions, taken at random, will "collide" is X%.  And to be fair, X
is very, very small.  If X were large, then we would do well to use pessimistic
locking instead, to serialize the transactions properly.  But for typical
applications, X is very small, which is what enables optimistic concurrency
schemes to be effective in the first place.

However, as we increase throughput, we naturally increase the likelihood
of collisions, because instead of two transactions potentially colliding,
we now have many more.  In fact, the probability of collisions goes up
surprisingly quickly, as anyone who understands the math behind the
"birthday paradox" will attest.

Let's now compare the likelihood of OC collisions between an appserver
that supports tuned updates, and one that does not.  For the appserver that
does not support tuned updates, every entity is written back to the database
at the end of every transaction, regardless of whether the entity was modified
in the transaction or not.  So, even for a readonly access to an entity, we
still write back all the columns to the table.  Unfortunately, most databases
(including Oracle) don't detect the fact that your update was "useless", they
just do the update regardless.  However, they do mark this row as having
been updated in the database, and will roll back any other transactions
which also update the row (even if those updates were also "useless"
updates).

On the other hand, if the CMP engine is tuning (or suppressing) the updates,
then only transactions that actually modified a particular row will issue an
update to the database.

Using our earlier assumption of a 10:1 ratio between reads and writes, we
can quickly see that for appservers that use tuned updates, there will be a
100 fold reduction in concurrency collisions.

So, as we turn up the throughput crank, we start to increase the frequency
of OC collisions.  In one appserver, we start seeing 100 times as many
collisions as in the other appserver.

But, you may argue, those collisions are still infrequent enough to not significantly
impact overall system throughput.  Well, one last consideration has to do with
"locality of reference".  It is common to observe what is known as "locality of
reference" in applications.  This means that there tend to be data "hotspots"; that
it, a disproportionate percentage of data access occurs in certain parts of the
data set.  For example, in a stock trading application, you will see some stocks
are much "hotter" than other stocks, at the data access level (and not surprisingly
the "hotness" of the data often corresponds to the "hotness" of the underlying
stock).

This "locality of reference" throws a curve-ball at our optimistic concurrency
collision measurements.  That is, if a disproportionate percentage of transactions
are accessing a subset of the data, the chance of a collision in that data set are
far higher.  But things get worse.

Let's assume that we have a higher chance of collisions on certain data.  As we
already discussed, if a collision occurs, the database will roll back the transaction
(which takes time).  Furthermore, it is usually the case that the transaction was
serving some purpose.  That is, the user wanted to run that particular transaction.
If the transaction is rolled back, it is generally likely that the user will still 
want to
run the transaction, and will therefore send it again.  So now, due to the expected
phenomenon of "locality of reference", we are starting to seeing the hotspot get
even hotter, due to (a) the fact that the database is spending more time on these
collision transactions, because it has to roll them back, and (b) the fact that the
failed transactions are typically run again (and again and again) until they
complete.

What we have is things going from bad to worse.  The data hotspot causes
collisions, causing slowdown and increased frequency, causing further collisions,
which feeds back on itself.  In short, the application stops scaling.  If you are
not lucky (and if you have not introduced some kind of throttling) the application
may even come to a screeching halt, due to live lock.

Now, I appreciate that as a thought experiment, all this may be less than completely
convincing.  To which I would respond:

1) This is not really a thought experiment.  Although I have never tried this on a
competing product, I have run these types of tests on Borland AppServer.  Since
our product has the ability to turn on "tuned" updates, or not, it is very easy for
us to see how OC in the appserver compares to OC in the database.  I did not
make this stuff up (in fact, I am not sure I could have made it up).  It comes from
long and tedious experimentation.

2) It is not very hard to try this yourself.  I recommend using ECperf, which is 
quickly
becoming the standard J2EE performance benchmark. Try running ECperf (which
is a fully spec compliant EJB-based application) with or without tuned updates.
See how differently it performs, and see how it scales or not, depending on how
the optimistic concurrency is implemented.

-jkw

===========================================================================
To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
of the message "signoff EJB-INTEREST".  For general help, send email to
[EMAIL PROTECTED] and include in the body of the message "help".
Re: To scale or to be correct: that is the question

Reply via email to