I posted this on the users list, but was advised to post it to dev as well, since it seemed relevant to developers. Hope that's ok...

I am using Apache 2.2.9 on Linux AMD64, built from source. There is one
server running two builds of Apache - a lightweight front-end caching
reverse proxy configuration using mod_disk_cache, and a heavyweight
mod_perl back end. I use caching to relieve load on the server when many
people request the same page at once. The website is dynamic and
contains millions of page permutations. Thus the cache has a tendency to
get fairly large, unless it is pruned. So I have been trying to use
htcacheclean to achieve this. There have been some issues, which I will
outline below.

First, I found that htcacheclean was not able to keep up with pruning
the cache. It just kept growing. I initially ran htcacheclean in daemon
mode, thus:

htcacheclean -i -t -n -d60 -p/var/cache/www -l1000M

CacheDirLevels was 3 and CacheDirLength 1.

The cache would just keep getting bigger, to multiple GB. Additionally,
even doing a du on the cache could take hours to complete.

I also noticed that iowait would spike when I tried running htcacheclean
in non-daemon mode. It would not keep up at all using the -n ("nice")
option; when I took that off, the iowait would go through the roof and
the process would take hours to complete. This was on a quad core AMD64
server with 4 x 10k SCSI drives in hardware RAID0.

Upon investigation, I discovered that the cache was a lot deeper than I
expected. In addition to the three levels specified in CacheDirLevels,
there were then additional levels of subdirectories beneath ".vary"
subdirs. For each .header file, there was a .vary subdir with three
levels of directory below that. Simply traversing this tree with du
could take a long time - hours sometimes, depending on how long the
server had been running without a cache clear.

I discovered that the .vary subdirs were caused by my configuration,
which was introducing a Vary http header. This came from two sources:
First, mod_deflate. I found this out from this helpful page:

http://www.digitalsanctuary.com/tech-blog/general/apache-mod_deflate-and-mod_cache-issues.html

So I disabled mod_deflate, since it seemed to be producing a huge number
of cache entries for each file - a different one for every browser. But
after disabling mod_deflate, the .vary subdirs were still there. I also
had this line in my config:

Header add Vary "Cookie"

This is necessary because users on my site set options for how the site
is displayed. When I tried disabling this cookie Vary header, the number
of directories went down substantially, to the expected three levels.
The cache structure was much simpler, and it seemed that htcacheclean
could keep up with this. However, the site was broken - since the same
page for different users with different options would be cached only
once. So someone who had "no ads" or "no pics" would request a page that
someone else had recently requested (with different options), and they
would get that other person's options. Not good. So I had to switch the
vary header for cookies back on, so that pages would get differentiated
in the cache based on cookie. But now I was back to square one - six
effective levels of subdirectory, which htcacheclean could not keep up with.

After some thought, I ended up changing CacheDirLevels to 2, to try to
reduce the depth of the tree. Now I had fewer subdirs, but more files in
each one.

Also, the size of the cache, via du, always seems to be much higher than
specified for htcacheclean. I lowered the limit to 100M, but still the
cache is regularly up at 180MB or 200MB. This seems counter-intuitive,
since htcacheclean doesn't appear to be taking the true size of the
cache into account (i.e. including all the subdirs, which also take up
space and presumably are what cause the discrepancy).

I also noticed something else: htcacheclean was leaving behind .header
files. When it cleaned the .vary subdirs, it seemed to leave behind the
corresponding .header files. These would accumulate, causing the iowait
to gradually increase, presumably due to the size of the directories. I
would rotate (clear) the cache manually at midnight. The behavior I
would see (via munin monitoring tool) was that iowait would then remain
at zero for about 12 hours, but then would gradually become visible as
the number of .header files would accumulate.

So I wrote a perl script which could go through the cache, and look for
.header files, and for each one found, see if a corresponding .vary
subdir exists for it. If not, then the .header file is deleted. I then
run another script to prune empty subdirectories. Currently I run this
combination every 10 minutes - first a non-daemon invocation of
htcacheclean, followed by the header prune script, followed by the empty
subdirs pruning script. This seems to keep the cache small, and iowait
is not noticeable any more, since the "junk" .header files are now
disposed of regularly.

However, I'm not sure why I need to run this kind of hacked up bespoke
version of cache management, when htcacheclean should surely be capable
of doing the job itself.

All of this brings up a few questions:

1. Why does mod_disk_cache generate six levels of subdirectory when
CacheDirLevels is clearly set to 3? I realize what it's trying to do,
(each page might have many variations and so those variations must be
differentiated by subdir) but the additional levels cause an exponential
increase in the number of directories that must be traversed. It seems
absurd when this causes trouble for a relatively well-specced server.
Since starting this investigation, I have moved to a completely new
server, a 4 core Xeon 2.33GHz, with 8 x 10k Raptor SATA drives in
hardware RAID10 configuration. The performance is excellent, but when I
tried using mod_disk_cache with CacheDirLevels at 3 and cookie Vary
headers on, it still could not keep up with pruning. Even simply
traversing this kind of structure with du is clearly not scalable. Could
we not have the three main levels of directory, but then have a
different setting for the number of subdirs below the .vary dirs?
Usually there is just one file at the leaf of the .vary subdirs, so
having three additional levels seems like a bit of overkill. We should
be able to tune the subdir levels to minimize the depth of the cache as
makes sense.

2. Why does htcacheclean not keep the cache at the stated size limit? If
you say -l100M and then do a du and it says 200M, then that is
counter-intuitive, and actually wrong in real terms. It gets worse with
the larger caches - when I had 3 levels and cookie Vary headers on, the
limit for htcacheclean was 1000M, but the cache would grow to 3GB and up.

3. Why are .header files left over by htcachelean when it has deleted
the .vary subdirectory? Is this something like a memory leak, but with
files? I would have thought that if the cached content (.data) file has
gone away, then why bother keeping the .header file around. It clogs up
the cache directory and makes traversing the tree more work. If it's
kept for 304 "unchanged" responses then I can understand that, but then
why do these files still seem to pile up even after the related page
would have clearly expired anyway? Surely better to just delete them
when the .vary subdir is deleted. In any case, I didn't notice the
.header files being left over when the Vary header was disabled, so I
think this might be a straightforward "leak" when using Vary.

4. Will I be causing any potential problems for Apache by my deleting
the leftover .header files myself (ones which have no corresponding
.vary subdir)? Does that cause apache or htcacheclean to have potential
issues if you do this while they are running? If they are junk then I
can't see it being a problem, but it's unclear currently if they are
actually used or not.

I wasn't sure if I should post this on the dev list, since it seems to
be more directed at the developers than other users. But the list
guidelines said that "Configuration and support questions should be
addressed to a user support group", and this seems to be that, so I'll
post it here first.

Thanks for any insights or feedback.

Neil

Reply via email to