On Sun, Apr 25, 2010 at 01:18:25PM +0200, Angelo Arrifano wrote: > Hello developers developers and developers, > > Ever wondered how much crap is left in your X-years old Gentoo box? > > I just developed a python utility to efficiently find orphaned files in > the system. By orphaned files I mean the files that are present on > system directories and don't belong to any installed package. > > The package builds a virtual filesystem (cache) on the RAM using python > hash tables. Then it uses the cache to find the ownership of files > inside user-specified dirs. > > Building the cache takes less than 10 seconds here in a system with 1366 > installed packages. > > This is not intended to be a finished program yet, I'm looking forward > for your constructive commentaries.
You're going to want to do realpathing here... also you'll need to
handle syms, and spaces are allowed in paths. I'd personally suggest
using one of the PM api's for this.
Part of the reason I advise poking at the PM apis is that it covers up
some of the nastier details w/ contents and others w/ parsing; simple
example,
python -c "
import sys
from pkgcore.config import load_config
from pkgcore.fs import contents, livefs
contents = contents.contentsSet()
for pkg in load_config().get_default('domain').named_repos['vdb']:
contents.update(pkg.contents);
stream = (x for x in livefs.iter_scan(sys.argv[1]) if x not in
contents)
print '\n'.join(map(str, sorted(stream)))
" desired-path
Note also that's a *very* quick writing. I'd personally look at
serializing the sorted lists to disk for both streams (what contents
says is on disk vs what is on disk), and then lockstep walking the
lists; via that you can keep the memory usage down.
~harring
pgpMnQ4d4ND2R.pgp
Description: PGP signature
