Hi
A small report "from the trenches" after we've had a visit from someone doing a directory-attack on one of our servers, and the effect this had. I have a few MRTG graphs to show, but they are located at an image-host (no nasties - just a direct link to the images): * last 2 hours showing the effect after "fixing" the problem at ~11:55 - http://billedhost.dk/filer/1221728072cpu_load_last_2_hours.png <http://billedhost.dk/filer/1221728072cpu_load_last_2_hours.png> * last 24 hours showing the effect of the problem - http://www.billedhost.dk/filer/1221728228cpu_load_last_day.png <http://www.billedhost.dk/filer/1221728228cpu_load_last_day.png> * last week showing what the normal load usually is (with a few spikes here and there) and what the directory-attack did - http://www.billedhost.dk/filer/1221728247cpu_load_last_week.png <http://www.billedhost.dk/filer/1221728247cpu_load_last_week.png> The history: At around 17:49 Sunday evening someone began a directory-attack on one of our servers. The attack was logged like this in Apache's httpd-access.log: aadi 24.249.18.233 - - [14/Sep/2008:17:49:33 +0200] "GET /~aadi/ HTTP/1.1" 404 204 "-" "-" "-" "-" "-" "-" aaliyah 24.249.18.233 - - [14/Sep/2008:17:49:33 +0200] "GET /~aaliyah/ HTTP/1.1" 404 207 "-" "-" "-" "-" "-" "-" aaralyn 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] "GET /~aaralyn/ HTTP/1.1" 404 207 "-" "-" "-" "-" "-" "-" aaron 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] "GET /~aaron/ HTTP/1.1" 404 205 "-" "-" "-" "-" "-" "-" abba 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] "GET /~abba/ HTTP/1.1" 404 204 "-" "-" "-" "-" "-" "-" abbie 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] "GET /~abbie/ HTTP/1.1" 404 205 "-" "-" "-" "-" "-" "-" (To help understand the fields, this is the logformat we use: "%V %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\" \ \"%{Via}i\" \"%{Pragma}i\" \"%{X-Forwarded-For}i\" \ \"%{Cache-Control}i\"") In total there was ~14.500 such tries spread equally across 5 vhosts - each one running their own Resin-instance. On average the ~2900 hits pr. vhost is really nothing, and would have gone unnoticed if the CPU load hadn't begun to rise above normal on Monday morning (anything above 1 for an extended period is abnormal for this server). So Monday I began looking at what happened on the server. The individual Resin processes were normal - stacktraces of the running VM's showed nothing, and stopping/starting them did next to nothing (this is a production-server, so there was percious little I could try without disrupting services for several thousand users). The interesting thing was that, according to the OS (RHEL5.1 x64), it was Apache who was responsible for the high load - not Resin. Stopping the main Resin process did lower the load, but as soon as it was started again the load rose back to the abormal level - and since we hadn't deployed a new version for a few weeks (and the stacktraces showed nothing of interest) we ruled out our own code as the culprit. I began to look elsewhere.. ran a few rootkit-dectors against a know list of sha1sum's in an attempt to see if (and hopefully not!) there were anything rotten, but nothing. Then I began to wonder - even though Apache's /server-status showed a rather normal load and nothing extraordinary (except 180-190% CPU usage), it must be possible to see what the individual Apache processes were doing. I attached to a process with $ strace -p <pid> and began to look at the system calls, and I soon began to wonder about a lot of these calls: open("/tmp/resintmp-DmLgMw", O_RDWR|O_CREAT|O_EXCL, 0600) = 44 write(44, "H\0\16check-intervalS\0\0015H\0\6cookieS\0"..., 16379) = 16379 ... write(44, "last-updateS\0\n1073741823h\0\0c\0\0e\0"..., 1818) = 1818 close(44) = 0 rename("/tmp/resintmp-DmLgMw", "/tmp/localhost_6856") = 0 unlink("/tmp/resintmp-DmLgMw") = -1 ENOENT (No such file or directory) stat("/tmp/localhost_6856", {st_mode=S_IFREG|0600, st_size=198379, ...}) = 0 ... open("/tmp/resintmp-K8ylXW", O_RDWR|O_CREAT|O_EXCL, 0600) = 44 write(44, "H\0\16check-intervalS\0\0015H\0\6cookieS\0"..., 16379) = 16379 ... write(44, "last-updateS\0\n1073741823h\0\0c\0\0e\0"..., 1818) = 1818 close(44) = 0 rename("/tmp/resintmp-K8ylXW", "/tmp/localhost_6856") = 0 unlink("/tmp/resintmp-K8ylXW") = -1 ENOENT (No such file or directory) Then I remember seeing a post about "localhost_<srun port>" files on the mailinglist from Vlad Artamonov (03 Aug 2008) and Scott Ferguson's reply on the 4th. Oh dear - I had some large localhost_<srun port> files: -rw------- 1 apache apache 198379 Sep 16 11:22 /tmp/localhost_6856 -rw------- 1 apache apache 176417 Sep 16 11:22 /tmp/localhost_6862 -rw------- 1 apache apache 766 Sep 16 11:26 /tmp/localhost_6873 -rw------- 1 apache apache 152038 Sep 16 11:20 /tmp/localhost_6880 -rw------- 1 apache apache 139985 Sep 16 09:25 /tmp/localhost_6893 -rw------- 1 apache apache 140689 Sep 16 11:21 /tmp/localhost_6897 (The smaller one (localhost_6873) is a site that's only available from specific IP's in the firewall and was never attacked, so I took that as a "normal" size) Looking at the contents of them (via less and strings) I could see a lot what looked like leftover garbage from the directory attack we experienced Sunday. I then stopped Apache, removed the files and restarted Apache, and as can be seen on the "last 2 hours" graph this immediately lowered the load (I removed the files around 11:55), and my problem vanished! This left me wondering. - Was there anything I could or should have done earlier to find the error (except to trace an Apache process as I did)? - Was Resin's mod_caucho (from the pro version 3.0.24) behaviour as expected - ie. it kept a really large cache-file updated on every request and persisted over restartes of both Apache and Resin? - Can Resin detect when performance issues arrise due to the large size and possible do something? - Can I somehow configure how often this file is updated - the documentation on the <dependency-check-interval> tag Scott mentioned doesn't mention the effect on the localhost_<srun port> files? - Shouldn't these files be reset or removed when a VM is shut down or started to ensure optimal performance? - Is the default behaviour different in Resin 3.1.x? - Is there a bug somewhere or somehow? - Did I do the "right thing" to remove the files, or should I have done something else entirely? Regards, Jens Dueholm Christensen Rambøll Survey IT
_______________________________________________ resin-interest mailing list resin-interest@caucho.com http://maillist.caucho.com/mailman/listinfo/resin-interest