Joel Divekar wrote:
Hi All
We have a windoz based file server with thousand of
user accounts. Each user is having thousand of files
in his home directory. Most of these files are
duplicate / modified or updated version of the
existing files. These files are either .doc or . xls
or .ppt files which are shared by groups or
departments.
Due to this my server is having terabyte of data, most
of which are redundant and our sysadmin has tough time
maintaining storage space.
For this I though of writing a small program to locate
similar or duplicate files stored on my file server
and delete them with the help of the user. The program
should work very fast and I don't know from where to
start.
Anybody out here to show me a direction to some links
on how to start and from there I shall take up. I
would also like to know long term solution for this
problem if any ? I am comfortable with linux or shell
programming.
Please advice. Thanks a lot.
Regards
Joel
Mumbai, India
9821421965
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
File::Find is one possibility except that it seems to behave badly when
files are being modified when the tree is being walked. My experience
of 'badly' is duplication of results. Nothing work, but something to be
aware of.
So you want to build a hash structure of FullPath => md5-hash
and then build a second hash of keys=>[files] and if the key has more
than one filename associated with it.... Then you probably want more
stat information (mtime) to decide which to purge.
This could probably be done in RAM if you are under 10^6 files.
Even if you can't hold the entire tree. You could at least do it in
chunks, like only look at files within a size range until you pare
things down a little.
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>