I have a simple Windows batch file doing most of what I want (the data's on a Windows PC) but I want to learn more about Bash scripting so I want to try the same thing in Linux (and then add a few bits to it).

The basic idea is that I have a lot (1000s) of .zip files that I need to extract, then recompress in one of a number of ways depending on the content. (At present they all get recompressed as Zip files, just the file extensions vary). I need to retain file date/time information from the original .zip in the process. The content that I'm looking for is just the presence of certain files (eg a mimetype file in the root of the .zip file usually means that the file is actually an OpenOffice .odt file, so I rename it accordingly).

At present, in Windows, I use 7-Zip's commandline tool to extract the files and recompress them, but it is losing the date/time information from directories so that's causing me a problem. Note that the .zip files have a lot of rubbish data tagged on the end of them, and Infozip's unzip doesn't seem to like them as a result. That's the reason I need to recompress them all (to lose the garbage) not just rename them.

Background: The files result from a data recovery attempt from a corrupt NTFS partition, so they've been located based on their file signatures. The resulting files are all 10M in size as the simple data recovery algorithm can't detect the file sizes. In many cases this means the real .zip file is probably only 1% of the total file with the rest being garbage. It also means that file types that "look like" zip files get recovered as .zip files, such as OpenOffice docs, Firefox plugins, etc. After the files are extracted I can often determine the correct filetype to use after recompression.

One other side effect of this recovery process is that a lot of the files actually contain duplicate data, so once the garbage has been removed I can check for duplicate files and dispose of all the copies. The .zip files store the file dats and times, though, so if the originals are lost in the process I end up with .zip files which only differ in the dates, but that's enough that a hash of the .zip file used to detect duplicates will find them to be different. A potential solution to this is to just "touch" all the directories to a specific date, which is better than the status quo when it comes to detecting duplicates, but still not as good as retaining the original timestamps).

Mark

_______________________________________________
Peterboro mailing list
[email protected]
https://mailman.lug.org.uk/mailman/listinfo/peterboro

Reply via email to