Hi all,

Our company currently runs RT for customer support interactions as well as a central email abuse reporting system for customer IP blocks. Recently we setup a feedback system with a large hosted mail provider and we saw the level of incoming abuse/spam reports increase to 10s of thousands a day. I have been trying to identify the source of an issue that essentially boils down to this: When our RT queues are 'large' (over 100K tickets) the UI struggles to complete operations or consumes all system resources.

To mitigate the issues, we have been using rt-shredder to cull out the excess, but I have a backed up DB to test with. What I have found is that on a particular type of search, the returned DB data set is so large the apache process handling the request consumes almost all available memory on the RT host, leading to swapping and/or a nasty segfault.

Our setup involves three hosts:

1 dedicated Gentoo based DB host running MySQL 5.0 with innodb based tables. 2G ram and 1 64bit quad core xeon running under VMWare vSphere 4. 2 load balanced Gentoo based apache servers running RT3.8.2 with the same proc/cpu specs as the DB host.

The magic search that overloads apache works as follows:

1) Click on our large queue from RT at a glance Quick Search. The queue in question contains 184744 new, 7 open and 7731 stalled tickets in my dumped database.
2) Click on any ticket on any of the returned pages.

Apache then consumes so much memory that we have to kill the process at best or restart the server at worst. In a browser, the ticket page often fails to load or may be partially completed before the host resources are exhausted.

The MySQL query also reveals that the last operation in this state returns a large chunk of data, and often pops up in the slow query log with an average execution time of 15 seconds. My first thought was that we had an issue with out database. However several days of testing indicated that this problem was directly related to apache/mod_perl having to drink from a firehose.

Here's the entry that always logs to the slow query log:

# Time: 100419 16:58:07
# u...@host: rtadmin[rtadmin] @ rt-test[172.20.0.99]
# Query_time: 15  Lock_time: 0  Rows_sent: 184751  Rows_examined: 377948
use rt3;
SELECT main.* FROM Tickets main WHERE (main.Status != 'deleted') AND (main.Queue = '11' AND ( main.Status = 'new' OR main.Status = 'open' ) ) AND (main.EffectiveId = main.id) AND (main.Type = 'ticket') ORDER BY main.id ASC;

Interestingly, searching for the specific ticket via the main page search box brings up the typed in ticket quickly and without incident. Another tidbit, is this appears to involve some level of caching. If I follow the above steps, then kill the process and finally select another ticket NOT in the large queue (one off my own top 15 tickets) then the same behavior is observed. Also, I see queries in the MySQL query log that include data related to the previous search. I have performed a battery of tests stopping daemons, clearing mason cache, clearing browser cache and the like to figure all this out.

The one detail about our setup that I suspect plays a part here is that a previous admin wrote a series of email handling scripts that always re-writes the sender address before handing the email off to rt-mailgate. We suspected at one point that part of the issue was related to the query that looks up other tickets created by the sender. An 'explain' in MySQL did show that the volume of data was forcing an on disk temp table and filesort, but I haven't directly correlated a slow DB operation to the consumption of memory on the apache side. That email handling apparatus is currently being replaced, but does having the same 'Creator' on 100K + tickets sound like a really bad thing, or is this normal for large shops?

Can I get some feedback on how our system compares to others using RT? How many tickets do you collect in a day? What rough system specs are you running on? Is this normal for large volumes of tickets? Is the only answer ever more RAM?

Also, I updated the test rig to 3.8.7 from 3.8.2 today including all DB upgrade operations with no change.

Any assistance would be very much appreciated. Our current game plan is to build an archiving system to keep the queue numbers down, but at some point large queues may be the norm for us. Thanks.

-Nick Geron

Discover RT's hidden secrets with RT Essentials from O'Reilly Media.
Buy a copy at http://rtbook.bestpractical.com

Reply via email to