I definately feel your pain.
I ran into the same issue on our mysql database once. After some
application malfunction we had massive amounts of tickets being
generated by a service for every user since they had implemented some
clever redirection loop and some issue where a new ticket was created
for every page impression.
The bug was introduced right in the peak period at the start of the
semester and caused something like 500 tickets/second to be generated.
Once the cleaner started running after our 2h exiry time the cas server
tanked with a OOM. Every restart the server was pretty much stuck and
the only quick solution to get up and running was to drop all ticket
tables and be done with it.
Increasing the memory and shortening the cleaner period was added as a
temporary fix to get up and running with crappy performance due to the
DOS like situation at that point. These fixes limited the number of
tickets to be cleaner per run and at least achieved a stable but slow
service. It worked as a temporary fix until the rogue app was fixed and
served us well in some similar incident later.
This was all before the cas throtteling feature was introduced...
Joachim
On 03.01.2012 15:40, Marvin Addison wrote:
The subject is intentionally provocative and based at least in part
from the production headaches it caused me over a holiday weekend
around 5AM. I'd like to provide a brief overview of the problem and
resolution steps since it may help others to evaluate
JpaTicketRegistry in considering a ticket storage backend.
Around 0500 I got a call from our NOC that CAS was unavailable, which
in this case meant the /login URI was throwing HTTP 500s. This of
course meant that CAS was entirely unusable. I confirmed the issue
then started a shell session on both hosts. Top and quick log review
both suggested both nodes were OOM, and logs also suggested that the
root cause was an attempt to clean up a massive amount of tickets.
Recall that the effect of RegistryCleaner running on a
JpaTicketRegistry is to buffer _all_ tickets into memory in order to
perform cleanup. I queried the database and confirmed there was an
unusually high number of tickets in the registry, which indicated that
I had to clean up tickets in order to triage the problem. I
temporarily disabled the cleaner trigger that drives
RegistryCleaner#clean() and redeployed CAS to get it back online, then
went about the work of cleaning up tickets.
Due to the self-referential nature of TGTs (a PGT is simply a TGT that
points to a parent TGT), this is tedious to impossible to do with
manual queries. Thankfully in our case we have exclusively proxy
tickets of chain length one, and the following two queries (on
PostgreSQL) issued sequentially will suffice:
delete from ticketgrantingticket where
to_timestamp(creation_time/1000)< $DATE and ticketgrantingticket_id
is not null;
delete from ticketgrantingticket where to_timestamp(creation_time/1000)< $DATE;
This cleans up all children before the parents and respects FK
constraints. This approach would not work with more complex proxy
chains. The only way to handle this situation generally with manual
queries would be to cascade deletes to child records, which is
fortunately possible on our platform (PostgreSQL) via the ON DELETE
CASCADE clause on the foreign keys. Unfortunately, Hibernate schema
creation does not specify this clause, so it would be needed to be
added manually. Tragically, making constraint changes on PostgreSQL
tables requires an exclusive table lock, which is simply not viable
for active production systems.
It's worth discussing briefly the cause of large numbers of expired
tickets at the root cause of this incident. PostgreSQL implements
BLOBs via a custom data type called a large object (lo) where columns
of the SQL LOB type are simply references to the lo objects (they
contain an int which is a handle to the lo). Since they are
references, you can get into two situations:
- Orphaned large objects (the vacuumlo tool and triggers alleviate
this situation)
- References to large objects that no longer exist
For some unknown reason, large objects are getting removed while
records still exist that reference them. Any attempt to load a
non-existent lo causes a SQLException on the Java side. These
exceptions tank the entire RegistryCleaner#clean() cycle, and
apparently they were happening often and early enough that cleanup was
effectively not happening. Logjam ensued.
I have spent significant development time on JpaTicketRegistry and
related components and to tuning our production CAS servers on two
different database platforms (Oracle and PostgreSQL). So I'm invested
in the approach, but I believe this recent incident is the last straw.
There are fundamental problems with JpaTicketRegistry and it will
take a fairly broad redesign of the TicketRegistry API to resolve them
adequately. I believe the use of the factory pattern that Scott has
explored in the feature-cas4api branch is at least on the right track,
but those are big changes that we simply can't wait for. Sure we
could fix some of the problems now and work around others, but I'm
coming to see that a database is not the best storage back end for our
needs. (If you're using the "Remember Me" feature, it starts to look
a lot more attractive. We don't, and the very durability of
database-backed tickets is a liability.)
M
--
You are currently subscribed to cas-dev@lists.jasig.org as:
arch...@mail-archive.com
To unsubscribe, change settings or access archives, see
http://www.ja-sig.org/wiki/display/JSG/cas-dev