On Sun, Aug 15, 2004 at 08:31:43AM +0930, Paul A. Hoadley wrote:
> Hello,
> I'm in the process of cleaning a Maildir full of spam.  It has
> somewhere in the vicinity of 400K files in it.  I started running
> this yesterday:
> find . -atime +1 -exec mv {} /home/paulh/tmp/spam/sne/ \;
> It's been running for well over 12 hours.  It certainly is
> working---the spams are slowly moving to their new home---but it is
> taking a long time.  It's a very modest system, running 4.8-R on a
> P2-350.  I assume this is all overhead for spawning a shell and
> running mv 400K times.

I wouldn't make that assumption. The overhead for starting new
processes is probably only a relatively small part of the time.

You seem to have missed the fact that operations on very large
directories (which a directory with 400K files in it certainly
qualifies as) simply are slow.
A directory is essentially just a list of the names of all the files in
it and their i-nodes.  To find a given file in a directory (e.g. in
order to create, delete or rename it) the system needs to do a linear
search through all the files in the directory. For directories
containing large number of files this can take some time.

If you have the UFS_DIRHASH kernel option enabled (which I believe is
the default since 4.5-R) then the system will keep bunch of hash-tables
in memory to avoid having to search through the whole directory every
time.  There is however an upper limit to how much memory will be used
for such hashtables (2MB by default) and if this limit is exceeded
(which it probably is in your case) things will slow down again.
The effect of the UFS_DIRHASH option is effectively that instead of
directory operations starting to slow down after a few thousand files
in the same directory, you can have a few tens of thousands of files
before operations start to become noticably slower.

I am quite certain that if those 400K files had been divided into 40
directories, each with 10K files in it, things would have been much

>  Is there a better way to move all files based
> on some characteristic of their date stamp?  Maybe separating the find
> and the move, piping it through xargs?  It's mostly done now, but I
> will know better for next time.

Reducing the number of processes spawned will certainly help some, but
a better idea is to not have so many files in a single directory - that
is just asking for trouble.

<Insert your favourite quote here.>
Erik Trulsson
[EMAIL PROTECTED] mailing list
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to