Hi

 

A small report "from the trenches" after we've had a visit from someone doing a 
directory-attack on one of our servers, and the effect this had.

 

I have a few MRTG graphs to show, but they are located at an image-host (no 
nasties - just a direct link to the images):

*          last 2 hours showing the effect after "fixing" the problem at ~11:55 
- http://billedhost.dk/filer/1221728072cpu_load_last_2_hours.png 
<http://billedhost.dk/filer/1221728072cpu_load_last_2_hours.png> 

*          last 24 hours showing the effect of the problem - 
http://www.billedhost.dk/filer/1221728228cpu_load_last_day.png 
<http://www.billedhost.dk/filer/1221728228cpu_load_last_day.png> 

*          last week showing what the normal load usually is (with a few spikes 
here and there) and what the directory-attack did - 
http://www.billedhost.dk/filer/1221728247cpu_load_last_week.png 
<http://www.billedhost.dk/filer/1221728247cpu_load_last_week.png> 

 

The history:

 

At around 17:49 Sunday evening someone began a directory-attack on one of our 
servers. The attack was logged like this in Apache's httpd-access.log:

 

aadi 24.249.18.233 - - [14/Sep/2008:17:49:33 +0200] "GET /~aadi/ HTTP/1.1" 404 
204 "-" "-"      "-" "-" "-"     "-"

aaliyah 24.249.18.233 - - [14/Sep/2008:17:49:33 +0200] "GET /~aaliyah/ 
HTTP/1.1" 404 207 "-" "-"        "-" "-" "-"     "-"

aaralyn 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] "GET /~aaralyn/ 
HTTP/1.1" 404 207 "-" "-"        "-" "-" "-"     "-"

aaron 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] "GET /~aaron/ HTTP/1.1" 
404 205 "-" "-"    "-" "-" "-"     "-"

abba 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] "GET /~abba/ HTTP/1.1" 404 
204 "-" "-"      "-" "-" "-"     "-"

abbie 24.249.18.233 - - [14/Sep/2008:17:49:34 +0200] "GET /~abbie/ HTTP/1.1" 
404 205 "-" "-"    "-" "-" "-"     "-"

 

(To help understand the fields, this is the logformat we use: "%V %h %l %u %t 
\"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\" \ \"%{Via}i\" \"%{Pragma}i\" 
\"%{X-Forwarded-For}i\" \ \"%{Cache-Control}i\"")

 

In total there was ~14.500 such tries spread equally across 5 vhosts - each one 
running their own Resin-instance. On average the ~2900 hits pr. vhost is really 
nothing, and would have gone unnoticed if the CPU load hadn't begun to rise 
above normal on Monday morning (anything above 1 for an extended period is 
abnormal for this server).

 

So Monday I began looking at what happened on the server. The individual Resin 
processes were normal - stacktraces of the running VM's showed nothing, and 
stopping/starting them did next to nothing (this is a production-server, so 
there was percious little I could try without disrupting services for several 
thousand users).

 

The interesting thing was that, according to the OS (RHEL5.1 x64), it was 
Apache who was responsible for the high load - not Resin. Stopping the main 
Resin process did lower the load, but as soon as it was started again the load 
rose back to the abormal level - and since we hadn't deployed a new version for 
a few weeks (and the stacktraces showed nothing of interest) we ruled out our 
own code as the culprit.

 

I began to look elsewhere.. ran a few rootkit-dectors against a know list of 
sha1sum's in an attempt to see if (and hopefully not!) there were anything 
rotten, but nothing.

 

Then I began to wonder - even though Apache's /server-status showed a rather 
normal load and nothing extraordinary (except 180-190% CPU usage), it must be 
possible to see what the individual Apache processes were doing.

 

I attached to a process with $ strace -p <pid> and began to look at the system 
calls, and I soon began to wonder about a lot of these calls:

 

open("/tmp/resintmp-DmLgMw", O_RDWR|O_CREAT|O_EXCL, 0600) = 44

write(44, "H\0\16check-intervalS\0\0015H\0\6cookieS\0"..., 16379) = 16379

...

write(44, "last-updateS\0\n1073741823h\0\0c\0\0e\0"..., 1818) = 1818

close(44)                               = 0

rename("/tmp/resintmp-DmLgMw", "/tmp/localhost_6856") = 0

unlink("/tmp/resintmp-DmLgMw")          = -1 ENOENT (No such file or directory)

stat("/tmp/localhost_6856", {st_mode=S_IFREG|0600, st_size=198379, ...}) = 0

...

open("/tmp/resintmp-K8ylXW", O_RDWR|O_CREAT|O_EXCL, 0600) = 44

write(44, "H\0\16check-intervalS\0\0015H\0\6cookieS\0"..., 16379) = 16379

...

write(44, "last-updateS\0\n1073741823h\0\0c\0\0e\0"..., 1818) = 1818

close(44)                               = 0

rename("/tmp/resintmp-K8ylXW", "/tmp/localhost_6856") = 0

unlink("/tmp/resintmp-K8ylXW")          = -1 ENOENT (No such file or directory)

 

Then I remember seeing a post about "localhost_<srun port>" files on the 
mailinglist from Vlad Artamonov (03 Aug 2008) and Scott Ferguson's reply on the 
4th.

 

Oh dear - I had some large localhost_<srun port> files:

 

-rw------- 1 apache apache 198379 Sep 16 11:22 /tmp/localhost_6856

-rw------- 1 apache apache 176417 Sep 16 11:22 /tmp/localhost_6862

-rw------- 1 apache apache       766 Sep 16 11:26 /tmp/localhost_6873

-rw------- 1 apache apache 152038 Sep 16 11:20 /tmp/localhost_6880

-rw------- 1 apache apache 139985 Sep 16 09:25 /tmp/localhost_6893

-rw------- 1 apache apache 140689 Sep 16 11:21 /tmp/localhost_6897

 

(The smaller one (localhost_6873) is a site that's only available from specific 
IP's in the firewall and was never attacked, so I took that as a "normal" size)

 

Looking at the contents of them (via less and strings) I could see a lot what 
looked like leftover garbage from the directory attack we experienced Sunday. 

 

I then stopped Apache, removed the files and restarted Apache, and as can be 
seen on the "last 2 hours" graph this immediately lowered the load (I removed 
the files around 11:55), and my problem vanished!

 

This left me wondering. 

 

- Was there anything I could or should have done earlier to find the error 
(except to trace an Apache process as I did)?

- Was Resin's mod_caucho (from the pro version 3.0.24) behaviour as expected - 
ie. it kept a really large cache-file updated on every request and persisted 
over restartes of both Apache and Resin?

- Can Resin detect when performance issues arrise due to the large size and 
possible do something?

- Can I somehow configure how often this file is updated - the documentation on 
the <dependency-check-interval> tag Scott mentioned doesn't mention the effect 
on the localhost_<srun port> files?

- Shouldn't these files be reset or removed when a VM is shut down or started 
to ensure optimal performance?

- Is the default behaviour different in Resin 3.1.x?

- Is there a bug somewhere or somehow? 

- Did I do the "right thing" to remove the files, or should I have done 
something else entirely?

 

Regards,

Jens Dueholm Christensen 
Rambøll Survey IT

_______________________________________________
resin-interest mailing list
resin-interest@caucho.com
http://maillist.caucho.com/mailman/listinfo/resin-interest

Reply via email to