I have a simple Windows batch file doing most of what I want (the data's
on a Windows PC) but I want to learn more about Bash scripting so I want
to try the same thing in Linux (and then add a few bits to it).
The basic idea is that I have a lot (1000s) of .zip files that I need to
extract, then recompress in one of a number of ways depending on the
content. (At present they all get recompressed as Zip files, just the
file extensions vary). I need to retain file date/time information from
the original .zip in the process. The content that I'm looking for is
just the presence of certain files (eg a mimetype file in the root of
the .zip file usually means that the file is actually an OpenOffice .odt
file, so I rename it accordingly).
At present, in Windows, I use 7-Zip's commandline tool to extract the
files and recompress them, but it is losing the date/time information
from directories so that's causing me a problem. Note that the .zip
files have a lot of rubbish data tagged on the end of them, and
Infozip's unzip doesn't seem to like them as a result. That's the reason
I need to recompress them all (to lose the garbage) not just rename them.
Background: The files result from a data recovery attempt from a corrupt
NTFS partition, so they've been located based on their file signatures.
The resulting files are all 10M in size as the simple data recovery
algorithm can't detect the file sizes. In many cases this means the real
.zip file is probably only 1% of the total file with the rest being
garbage. It also means that file types that "look like" zip files get
recovered as .zip files, such as OpenOffice docs, Firefox plugins, etc.
After the files are extracted I can often determine the correct filetype
to use after recompression.
One other side effect of this recovery process is that a lot of the
files actually contain duplicate data, so once the garbage has been
removed I can check for duplicate files and dispose of all the copies.
The .zip files store the file dats and times, though, so if the
originals are lost in the process I end up with .zip files which only
differ in the dates, but that's enough that a hash of the .zip file used
to detect duplicates will find them to be different. A potential
solution to this is to just "touch" all the directories to a specific
date, which is better than the status quo when it comes to detecting
duplicates, but still not as good as retaining the original timestamps).
Mark
_______________________________________________
Peterboro mailing list
[email protected]
https://mailman.lug.org.uk/mailman/listinfo/peterboro