A day or so after downloading Polipo, I decided to try
polipo-trimcache. I had 36MB of cache, so I tried a dry-run down to
30MB. It took it all the way down to 16MB!
Well, this seemed a little weird, but fortunately I had run a du on the
cache shortly before, so I recognized the last file deleted. I had
browsed for awhile, accumulating about 4MB of small files in the cache,
and then saw a flash video. The .flv file was 16MB, so to get the cache
below 30M, the .flv had to go. But in theory, the rest of the files
could have stayed.
So, I wrote a new version of polipo_trimcache that fixes this. It first
finds all the files that would usually be deleted, but just stores them
in a list. It then goes back and adds any files that appear to fit back
on the end of cobjs. Finally, it puts the list to delete on the end of
cobjs, and runs what is basically J. P.'s original deletion code, just
in a different order. In the pathological case above, this deleted the
.flv file and nothing else. In most cases, it doesn't do a whole lot,
but it usually gets to the desired megabyte size to within the printed
precision!
Also, I happen to be running this on Windows (with Cygwin, hence du),
and Python doesn't know that the NTFS block size is 4K. So I added an
option to force a block size. Then Python 3.0 came out, so I made the
program compatible with versions 2.5-3.0 (and it probably still works
with 2.4 too.)
Ken
#!/usr/bin/env python
# -*- Mode: Python; tab-width: 8; py-indent-offset: 8; indent-tabs-mode: nil -*-
#
# polipo_trimcache version 0.2
# Written by J.P. Larocque, <[EMAIL PROTECTED]>, OpenPGP 0xF61D2E61
# This software is in the public domain.
#
# Polipo is a small, caching HTTP proxy with nifty features like
# client and server IPv6 support, pipelining, and the caching of
# partial objects. See: http://www.pps.jussieu.fr/~jch/software/polipo/
#
# Polipo doesn't have any built-in facility for keeping its cache
# directory under a given fixed size. This script implements that: it
# trims your cache directory to a target size by removing the files with
# oldest access times until the cache size is smaller than or equal to
# the target size.
#
# This script requires Python 2.2 or greater.
#
# Some notes about polipo_trimcache's operation:
# * Send Polipo SIGUSR1 prior to and SIGUSR2 after running this
# script, as if you were using Polipo's own cache-purging
# facility. This is explained in the manual:
# http://www.pps.jussieu.fr/~jch/software/polipo/manual/Purging.html
# * If you run Polipo non-root (as you should), you should also run
# this script as the same user. It is conceivable that, given a
# carefully-crafted cache directory, polipo_trimcache could mangle
# with things outside the cache, either because of bugs in the
# script or in Python. Restricting this script to execution by
# the same or another isolated user will protect you from problems
# a cracker could cause if Polipo were cracked. (No current
# security issues with any of these pieces of software are known
# by the author at the time of writing.)
# * Hidden files and files other than regular files and directories
# are skipped. This means symlinks are not honored, with the
# exception of the root cache directory. This shouldn't matter
# for normal use.
# * On systems that report it, polipo_trimcache uses st_blocks
# from stat(). If st_blocks is 0 for a given file/directory, it
# uses st_size instead. You can override and give a block
# size with the --force-blocksize option.
# * polipo_trimcache tries to match the TARGET_SIZE as closely as
# possible. This means that some small files older than
# "Latest file removed" may remain in the cache.
# * polipo_trimcache removes directories when it removes the last
# file from them (except for the root cache directory).
#
# Bugs: (both of which are caused by the last point above)
# * Empty directories aren't removed if they didn't contain any
# files to begin with.
# * Dry-run mode never pretends to remove directories, since it
# never removes files, enabling the condition for directories to
# be removed.
#
# The latest version of this program can be found at the URL:
# http://ely.ath.cx/~piranha/software/polipo_trimcache/
# This program has been signed with my public key. Its fingerprint is:
# 5612 10A8 4986 2D85 A995 252B 4C02 5E02 F61D 2E61
# The detached signature can be found at the URL:
#
http://ely.ath.cx/~piranha/software/polipo_trimcache/polipo_trimcache-0.2.py.asc
#
# Changelog:
# version 0.3 by Ken Brazier (2008/Nov/29)
# Improved size targeting, added force-blocksize.
# Also made Python 3.0 compatible.
# version 0.2 (2004/Aug/10)
# Cleaned up for initial release.
# version 0.1 (2004/Jun/27)
# Initially written.
#
help = \
"""Usage: polipo_trimcache [-fnpvVDh?] [-b blocksize] POLIPO_CACHE_DIR
TARGET_SIZE
-f, --force
Force questionable behavior. Currently this means
continuing even when the TARGET_SIZE is less than 2KB.
-n, --no-op, --dry-run
Don't make any changes to the cache contents. Print
the files that would be deleted instead.
-p, --precise-expiry
Determines last-access time of cache files by opening
them and reading their X-Polipo-Access header. The
default behavior is to use the last-modification time
of the file, which should be faster. Analogous to the
Polipo configuration option "preciseExpiry".
-v, --verbose
Verbose operation. Prints every stage of the operation,
and reports hidden or special files.
-b, --force-blocksize
Force a given blocksize. Python can't determine the
block size on all filesystems; if you can, do so here.
If you don't understand this option, ignore it. :)
-V, --version
Print version number, and exit.
-D, --debug
Debugging mode. Report each directory on cache
transversal, and each unlink/rmdir operation.
-h, -?, --help
Print usage information.
POLIPO_CACHE_DIR specifies the root of the Polipo cache. TARGET_SIZE
specifies the size polipo_trimcache trims down to, expressed in bytes.
(You may specify megabytes or gigabytes by appending "M" or "G" to the
size argument. The quantifiers [KMGTPEZY] are supported (future-
proof!).)
Example:
polipo_trimcache /var/local/cache/polipo 256M
Deletes enough files from said cache directory to make it use under
256 megabytes.
"""
version = "0.3"
last_modified = "2008/December/04"
import os, os.path, sys, re, stat, time, getopt, math
me = os.path.basename (sys.argv[0])
# These are defaults, not hard-coded settings.
force_blocksize = 1
force_mode = 0
no_op_mode = 0
precise_expiry_mode = 0
verbose_mode = 0
debug_mode = 0
def main (args):
global force_blocksize, force_mode, no_op_mode, precise_expiry_mode,
verbose_mode, debug_mode
try:
options, extra = getopt.getopt (args, "b:fnpvVDh?",
("force-blocksize=","force", "no-op", "noop", "dry-run", "precise-expiry",
"verbose",
"help", "version", "debug"))
except getopt.GetoptError:
usage ()
return 1
for option, argument in options:
if option in ("-b", "--force-blocksize"):
force_blocksize = hr2num(argument)
elif option in ("-f", "--force"):
force_mode = 1
elif option in ("-n", "--no-op", "--noop", "--dry-run"):
no_op_mode = 1
elif option in ("-p", "--precise-expiry"):
precise_expiry_mode = 1
elif option in ("-v", "--verbose"):
verbose_mode = 1
elif option in ("-h", "-?", "--help"):
sys.stdout.write (help)
return 0
elif option in ("-V", "--version"):
print ("%s version %s (%s)" % (me, version,
last_modified))
return 0
elif option in ("-D", "--debug"):
debug_mode = 1
# To force assumption of a given block size in bytes. Default of 1 to
disable.
if len (extra) != 2:
usage ()
return 1
cache_dir = extra[0]
target_size = hr2num (extra[1])
verbose_msg ("Transversing cache...")
cobjs, cdirs = transverse_cache (cache_dir)
verbose_msg ("Sorting and counting...")
cobjs.sort (key=lambda o1: o1.last_access,reverse=True)
total_size = long(0)
for cfile in cobjs + cdirs:
total_size += cfile.size
verbose_msg ("Cache is currently %sB (%d files, %d dirs)." % (num2hr
(total_size), len (cobjs), len (cdirs)))
if(total_size <= target_size):
verbose_msg("Cache is small enough. Exiting.")
sys.exit(0)
verbose_msg ("Trimming...")
files_removed = dirs_removed = 0
latest_trimmed = None
del_list = []
total_est_size = total_size
while cobjs and (total_est_size > target_size):
# Find the oldest-used cache object, and mark it for deletion.
victim = cobjs.pop()
del_list.insert(0, victim)
total_est_size -= victim.size
# The directory check is useless in no-op mode; they wouldn't
be deleted, and it would look for more files to delete.
if not no_op_mode:
# Now delete entry in parent directory.
victim = victim.parent
victim.children -= 1
# We assume there's no files we skipped over (hidden,
special), for size purposes.
if (victim.children == 0): total_est_size -= victim.size
# This is roughly what polipo_trimcache v0.2 would have done.
verbose_msg("Cache estimate 1: %sB (%d files, unknown dirs)" %
(num2hr(total_est_size), len(cobjs)))
# Now see if any of those files were small enough that we can add them
back.
# We traverse del_list, which is in the same order as cobjs, but
looking to save the newest files, instead of delete the oldest.
# This is, of course, a Bin Packing Problem, so a perfect solution is
Hard. But this is close.
for i in xrange(len(del_list)):
if(total_est_size >= target_size): break
if(total_est_size + del_list[i].size <= target_size):
# The last deleted object (most recently used) gets
pushed on the stack first, so it's least likely to be deleted again.
cobjs.append(del_list[i])
total_est_size += del_list[i].size
victim = del_list[i].parent
del_list[i] = None
if not no_op_mode:
# Now restore entry in parent directory.
victim.children += 1
# We assume there's no files we skipped over
(hidden, special), for size purposes.
if (victim.children == 1): total_est_size +=
victim.size
# This is The Plan; but directories could change it.
verbose_msg("Cache estimate 2: %sB (%d files, unknown dirs)" %
(num2hr(total_est_size), len(cobjs)))
# Need to clean up the directory listings, so they'll be back to normal
for the real delete.
if not no_op_mode:
for victim in del_list:
# Any that are None were already done.
if(victim == None): continue
victim = victim.parent
# Now restore entry in parent directory.
victim.children += 1
# Now, the deletion sequence will be:
# 1. Files slated for deletion (oldest first, but almost certainly all.)
# 2. Files we thought we could restore (oldest first, but probably few
to none.)
# 3. Other files (few to none.)
cobjs += del_list
while cobjs and (total_size > target_size):
# Delete the oldest-used cache object, and possibly its parent
directory.
victim = cobjs.pop ()
# Skip deleted entries from del_list.
if(victim == None): continue
if no_op_mode:
print (victim.path)
else:
debug_msg ("Removing file: %s" % victim.path)
os.unlink (victim.path)
total_size -= victim.size
files_removed += 1
if(latest_trimmed == None or victim.last_access >
latest_trimmed): latest_trimmed = victim.last_access
# Now delete parent and grandparent directories, if empty.
victim = victim.parent
while victim:
victim.children -= 1
# We add the check for files with listdir in case
there's any files we skipped over (hidden, special).
if (victim.children == 0) and (not os.listdir
(victim.path)):
if no_op_mode:
debug_msg ("Removing directory [NOOP]:
%s" % victim.path)
else:
debug_msg ("Removing directory: %s" %
victim.path)
os.rmdir (victim.path)
total_size -= victim.size
cdirs.remove (victim)
dirs_removed += 1
# Parent directory is now the next candidate
for removal.
victim = victim.parent
else:
break
verbose_msg ("Cache trimmed (-%d files, -%d dirs) to %sB (%d files, %d
dirs)." % \
(files_removed, dirs_removed, num2hr (total_size), len
(cobjs), len (cdirs)))
if latest_trimmed is not None:
verbose_msg ("Latest file removed was %s old" % secs2thr (int
(time.time ()) - latest_trimmed))
def warn (msg):
sys.stderr.write ("%s: %s\n" % (me, msg))
def usage ():
warn ("usage: %s [-fnpvVDh?] POLIPO_CACHE_DIR TARGET_SIZE" % me)
def verbose_msg (msg):
if verbose_mode: warn (msg)
def debug_msg (msg):
if debug_mode: warn (msg)
class CacheFile:
"""Superclass for any kind of (non-special) file found in the Polipo
cache tree."""
pass
class CacheDir (CacheFile):
"""Class for directories only."""
def __init__ (self, parent = None):
self.parent = parent
self.children = None
self.size = None
self.path = None
def __repr__ (self):
# Still waiting on that ternary operator. OH WAIT, those
"decrease readability".
if self.children is None: children = "?"
else: children = str (self.children)
if self.size is None: size = "?"
else: size = str (self.size)
if self.path is None: path = "?"
else: path = "\"%s\"" % self.path
return "<CacheDir instance, path=%s, children=%s, size=%s,
has_parent=%d>" % \
(path, children, size, self.parent != None)
class CacheObject (CacheFile):
"""Class for regular files (cached HTTP objects) only."""
def __init__ (self, parent = None):
self.parent = parent
self.last_access = None
self.size = None
self.path = None
def __repr__ (self):
if self.last_access is None: last_access = "?"
else: last_access = time.strftime ("[%Y/%m/%d %H:%M:%S]",
time.localtime (self.last_access))
if self.size is None: size = "?"
else: size = str (self.size)
if self.path is None: path = "?"
else: path = "\"%s\"" % self.path
return "<CacheObject instance, path=%s, last_access=%s,
size=%s, has_parent=%d>" % \
(path, last_access, size, self.parent != None)
def transverse_cache (dir, parent = None):
"""Recursively transverses a given directory. Returns a tuple of 0) a
list of all found files as CacheObject instances, and 1) a list of all found
directories as CacheDir instances (except for the given root directory). Note
that the return values are flat lists and not some sort of tree structure. If
parent is given, each file in the given directory will be assigned its value as
the parent directory, and the children attribute of the parent will be assigned
the number of children found in the given directory."""
debug_msg ("Entering directory: %s" % dir)
cobjs = []
cdirs = []
children = 0
for file in os.listdir (dir):
file_full = os.path.join (dir, file)
if file.startswith ("."):
verbose_msg ("Skipping hidden file: %s" % file_full)
continue
file_stat = os.lstat (file_full)
if stat.S_ISDIR (file_stat.st_mode):
cfile = CacheDir (parent)
cdirs.append (cfile)
sub_cobjs, sub_cdirs = transverse_cache (file_full,
cfile)
cobjs += sub_cobjs
cdirs += sub_cdirs
del (sub_cobjs, sub_cdirs)
elif stat.S_ISREG (file_stat.st_mode):
cfile = CacheObject (parent)
cobjs.append (cfile)
if precise_expiry_mode:
cfile.last_access = get_precise_access
(file_full)
# If there is no precise access-time
information, we use something "very old":
if cfile.last_access == None:
cfile.last_access = 0
else:
cfile.last_access = file_stat.st_mtime
else:
verbose_msg ("Skipping special file: %s" % file_full)
continue
try:
cfile.size = file_stat.st_blocks * 512
# Directories seem to have an st_blocks value of 0.
if cfile.size == 0:
cfile.size = file_stat.st_size
except AttributeError:
cfile.size = file_stat.st_size
# Force a given blocksize; default 1 byte.
cfile.size =
force_blocksize*((cfile.size+force_blocksize-1)//force_blocksize)
cfile.path = file_full
children += 1
if parent:
parent.children = children
return cobjs, cdirs
def get_precise_access (file_name):
"""Open a Polipo cache file and return its last access time, determined
from X-Polipo-Access."""
access = None
file = open (file_name)
# For HTTP header line:
file.readline ()
line_num = 1
while 1:
line_num += 1
line = file.readline ()
if line == "\r\n": break
try:
header, value = line.split (": ", 1)
if header.lower () == "x-polipo-access":
access = int (rfc822.mktime_tz
(rfc822.parsedate_tz (value)))
except ValueError:
whine ("Reading \"%s\":%d" % (file_name, line_num))
break
file.close ()
return access
def whine (msg):
"""Warns the given message and information on the current exception
being handled."""
exc_class, exc_instance = sys.exc_info ()[:2]
exc_args = exc_instance.args
warn ("%s: %s: %s" % (msg, exc_class, exc_args))
hr2num_re = re.compile (r'^([0-9]+(?:\.[0-9]+)?) *([KMGTPEZY]?)$', re.I)
try:
long
except NameError:
long = int
try:
xrange
except NameError:
xrange = range
quants_dict = {
'K': long(1024) ** 1,
'M': long(1024) ** 2,
'G': long(1024) ** 3,
'T': long(1024) ** 4,
# It's uh, future-proof...
'P': long(1024) ** 5,
'E': long(1024) ** 6,
'Z': long(1024) ** 7,
'Y': long(1024) ** 8,
}
quants_tups = quants_dict.items ()
quants_tups = sorted (quants_tups,key=lambda t1: t1[1])
quants_tups_dec = quants_tups
quants_tups_dec.reverse ()
tquants_dict = {
'm': 60,
'h': 60 * 60,
'd': 60 * 60 * 24,
'w': 60 * 60 * 24 * 7,
}
tquants_tups = tquants_dict.items ()
tquants_tups = sorted (tquants_tups,key=lambda t1: t1[1])
tquants_tups_dec = tquants_tups
tquants_tups_dec.reverse ()
def hr2num (hr):
"""Given a string with an integral or floating-pointing number and an
optional quantifier, return a long of the expressed amount."""
match = hr2num_re.match (hr)
if match:
amount = float (match.group (1))
quant = match.group (2)
if quant:
amount *= quants_dict[quant.upper ()]
return long (math.ceil (amount))
else:
raise ValueError("bad human-readable expression
(\"NN.N[KMGTPEZY]\")")
def num2hr (num):
"""Given an integer, return a string with the same amount expressed
with a quantifier."""
num = long (num)
for quant_abbr, quant_amount in quants_tups_dec:
if num >= quant_amount:
return "%.2f%s" % (float (num) / quant_amount,
quant_abbr)
return str (num)
def secs2thr (seconds):
output = ""
for quant_abbr, quant_amount in tquants_tups_dec:
if seconds >= quant_amount:
q_amount = int (seconds) // quant_amount
seconds -= q_amount * quant_amount
output += "%d%s" % (q_amount, quant_abbr)
if seconds == math.floor (seconds):
output += "%ds" % seconds
else:
output += "%.1fs" % seconds
return output
if __name__ == "__main__":
sys.exit (main (sys.argv[1:]))
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Polipo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/polipo-users