Not experienced enough in powershell to suggest code, BUT I would advise that you make the process run as a restartable facility such that the process can be interrupted ( if not by escape or ctrl+c, then by task killing) and then, when restarted will continue processing a list of files from the one after the last one for which a result was recorded. Working on the basis that you have a 1TB file store, and are working towards a 3, or 6TB filestore, even assuming your filestore connection runs at 8Gb/sec, as in 60GB per minute, that's surely going to be an hours full time use of the interface, and I'd really expect the hashing process to take getting on for a day elapsed if the system is running - spinning media on a more common interface connection, rather than a solid state store on the fastest possible multi-channel interface.
You may also need to consider the system overhead in assembling the list of files - sheer volume of the MFT to be processed, I know from a fair amount of the restructuring work I used to do for clients on a 4GB memory system with caddy'd drives - Such as renaming files that filled a 1TB drive, for access as 'home drives' - before you had all the maintenance goodies in the admin facilities. (Having taken a complete list of files, stuck them in Excel, sorted them there, and generated a set of rename commands.) It took more time processing the MFT entries to "rename" the files in situ - than it did to copy them to another drive with the new names. Simply because of the thrashing on the MFT blocks in the OS allocated disk read cache. JimB -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Kurt Buff Sent: Thursday, July 30, 2015 8:45 PM To: [email protected] Subject: [powershell] Need some pointers on an exercise I've set for myself I'm putting together what should be a simple little script, and failing. I am ultimately looking to run this against a directory, then sort the output on the hash field and then parse for duplicates. There are two conditions that concern me: 1) there are over 3m files in the target directory, and 2) many of the files are quite large, over 1g. I'm more concerned about the effects of the script on memory than on processor - the data is fairly static, and I intend to run it once a month or even less, but I did choose MD5 as the hash algorithm for speed, rather than accept the default of SHA256. This is pretty simple stuff, I'm sure, but I'm using this as a learning exercise more than anything, as there are duplicate file finders out in the world already. There are several problems with what I have put together so far, which this this: Get-ChildItem c:\stuff -Recurse | select length, fullname | export-csv -NoTypeInformation c:\temp\files.csv Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash -algorithm md5 $_.FullName) }; Length | Sort hash Using Length (or $_.Length) anywhere in the foreach statement gives an error, or gives weird output. Sample Output when not using Length, and therefore getting reasonable output (extra spaces and hyphen delimiters elided): Algorithm Hash Path MD5 592BE1AD0ED83C36D5E68CA7A014A510 C:\stuff\Tools\SomeFile.DOC What I'd like to see instead Hash Length Path 592BE1AD0ED83C36D5E68CA7A014A510 79872 C:\stuff\Tools\SomeFile.DOC If anyone can offer some instruction, I'd appreciate it. Kurt ================================================ Did you know you can also post and find answers on PowerShell in the forums? http://www.myitforum.com/forums/default.asp?catApp=1 ================================================ Did you know you can also post and find answers on PowerShell in the forums? http://www.myitforum.com/forums/default.asp?catApp=1
