I understand if you're looking for a more permanent solution, but if you're in a pinch and just wanted to do a cursory check before you're able to complete the script for the VP, I would recommend DoubleKiller. It allows you to pick the directory/directories and filter based on wildcards (*.iso, *.zip, etc.).
http://www.bigbangenterprises.de/en/doublekiller/ Sorry I'm not any help on the PS side :-) -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Kurt Buff Sent: Thursday, July 30, 2015 3:09 PM To: [email protected] Subject: Re: [powershell] Need some pointers on an exercise I've set for myself File store approaches 3tb now - just about 290gb free on a 3.1tb partition. The concern is that I've noticed a fair number of ISO files (and potentially a lot of other files including zip and other archives, and mpegs, etc.) that seem to be duplicates of each other. I want to generate a report for the VP of engineering, and let him know how bad the situation is - I'm going to guess there's close to 1tb of redundancy currently. Yes, this will consume hours of time, but I can launch it over a weekend and take a look on the Monday following. I like your idea of restartability, though - it's worth looking at as a secondary goal. Kurt On Thu, Jul 30, 2015 at 2:19 PM, James Button <[email protected]> wrote: > Not experienced enough in powershell to suggest code, BUT I would > advise that you make the process run as a restartable facility such that the > process can be interrupted ( if not by escape or ctrl+c, then by task > killing) and then, when restarted will continue processing a list of files > from the one after the last one for which a result was recorded. > > Working on the basis that you have a 1TB file store, and are working towards > a 3, or 6TB filestore, even assuming your filestore connection runs at > 8Gb/sec, as in 60GB per minute, that's surely going to be an hours full time > use of the interface, and I'd really expect the hashing process to take > getting on for a day elapsed if the system is running - spinning media on a > more common interface connection, rather than a solid state store on the > fastest possible multi-channel interface. > > You may also need to consider the system overhead in assembling the > list of files - sheer volume of the MFT to be processed, I know from > a fair amount of the restructuring work I used to do for clients on a 4GB > memory system with caddy'd drives - Such as renaming files that filled a 1TB > drive, for access as 'home drives' - before you had all the maintenance > goodies in the admin facilities. > > (Having taken a complete list of files, stuck them in Excel, sorted > them there, and generated a set of rename commands.) > > It took more time processing the MFT entries to "rename" the files in situ - > than it did to copy them to another drive with the new names. > Simply because of the thrashing on the MFT blocks in the OS allocated disk > read cache. > > JimB > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Kurt Buff > Sent: Thursday, July 30, 2015 8:45 PM > To: [email protected] > Subject: [powershell] Need some pointers on an exercise I've set for > myself > > I'm putting together what should be a simple little script, and failing. > > I am ultimately looking to run this against a directory, then sort the > output on the hash field and then parse for duplicates. There are two > conditions that concern me: 1) there are over 3m files in the target > directory, and 2) many of the files are quite large, over 1g. > > I'm more concerned about the effects of the script on memory than on > processor - the data is fairly static, and I intend to run it once a > month or even less, but I did choose MD5 as the hash algorithm for > speed, rather than accept the default of SHA256. > > This is pretty simple stuff, I'm sure, but I'm using this as a > learning exercise more than anything, as there are duplicate file > finders out in the world already. > > There are several problems with what I have put together so far, which > this this: > > Get-ChildItem c:\stuff -Recurse | select length, fullname | > export-csv -NoTypeInformation c:\temp\files.csv > Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash > -algorithm md5 $_.FullName) }; Length | Sort hash > > Using Length (or $_.Length) anywhere in the foreach statement gives an > error, or gives weird output. > > Sample Output when not using Length, and therefore getting reasonable > output (extra spaces and hyphen delimiters elided): > Algorithm Hash > Path > MD5 592BE1AD0ED83C36D5E68CA7A014A510 > C:\stuff\Tools\SomeFile.DOC > > What I'd like to see instead > Hash Length > Path > 592BE1AD0ED83C36D5E68CA7A014A510 79872 C:\stuff\Tools\SomeFile.DOC > > If anyone can offer some instruction, I'd appreciate it. > > Kurt > > > ================================================ > Did you know you can also post and find answers on PowerShell in the forums? > http://www.myitforum.com/forums/default.asp?catApp=1 > > > > ================================================ > Did you know you can also post and find answers on PowerShell in the forums? > http://www.myitforum.com/forums/default.asp?catApp=1 > ================================================ Did you know you can also post and find answers on PowerShell in the forums? http://www.myitforum.com/forums/default.asp?catApp=1 Confidentiality Notice: This is a transmission from Community Hospital of the Monterey Peninsula. This message and any attached documents may be confidential and contain information protected by state and federal medical privacy statutes. They are intended only for the use of the addressee. If you are not the intended recipient, any disclosure, copying, or distribution of this information is strictly prohibited. If you received this transmission in error, please accept our apologies and notify the sender. Thank you. ================================================ Did you know you can also post and find answers on PowerShell in the forums? http://www.myitforum.com/forums/default.asp?catApp=1
