Not experienced enough in powershell to suggest code, 
BUT
I would advise that you make the process run as a restartable facility such 
that the process can be interrupted ( if not by escape or ctrl+c, then by task 
killing) and then, when restarted will continue processing a list of files from 
the one after the last one for which a result was recorded.
 
Working on the basis that you have a 1TB file store, and are working towards a 
3, or 6TB filestore, even assuming your filestore connection runs at 8Gb/sec, 
as in 60GB per minute, that's surely going to be an hours full time use of the 
interface, and I'd really expect the hashing process to take getting on for a 
day elapsed if the system is running - spinning media on a more common 
interface connection, rather than a solid state store on the fastest possible 
multi-channel interface.   

You may also need to consider the system overhead in assembling the list of 
files - sheer volume of the MFT to be processed, 
I know from  a fair amount of the restructuring work I used to do for clients 
on a 4GB memory system  with caddy'd drives - 
Such as renaming files that filled a 1TB drive, for access as 'home drives' - 
before you had all the maintenance goodies in the admin facilities.
 
(Having taken a complete list of files, stuck them in Excel, sorted them there, 
and generated a set of rename commands.) 

It took more time processing the MFT entries to "rename" the files in situ - 
than it did to copy them to another drive with the new names.     
Simply because of the thrashing on the MFT blocks in the OS allocated disk read 
cache.

JimB


-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Kurt Buff
Sent: Thursday, July 30, 2015 8:45 PM
To: [email protected]
Subject: [powershell] Need some pointers on an exercise I've set for myself

I'm putting together what should be a simple little script, and failing.

I am ultimately looking to run this against a directory, then sort the
output on the hash field and then parse for duplicates. There are two
conditions that concern me: 1) there are over 3m files in the target
directory, and 2) many of the files are quite large, over 1g.

I'm more concerned about the effects of the script on memory than on
processor - the data is fairly static, and I intend to run it once a
month or even less, but I did choose MD5 as the hash algorithm for
speed, rather than accept the default of SHA256.

This is pretty simple stuff, I'm sure, but I'm using this as a
learning exercise more than anything, as there are duplicate file
finders out in the world already.

There are several problems with what I have put together so far, which
this this:

     Get-ChildItem c:\stuff -Recurse | select length, fullname |
export-csv -NoTypeInformation c:\temp\files.csv
     Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash
-algorithm md5 $_.FullName) }; Length | Sort hash

Using Length (or $_.Length) anywhere in the foreach statement gives an
error, or gives weird output.

Sample Output when not using Length, and therefore getting reasonable
output (extra spaces and hyphen delimiters elided):
     Algorithm   Hash
        Path
     MD5          592BE1AD0ED83C36D5E68CA7A014A510   C:\stuff\Tools\SomeFile.DOC

What I'd like to see instead
     Hash                                                          Length   Path
     592BE1AD0ED83C36D5E68CA7A014A510    79872    C:\stuff\Tools\SomeFile.DOC

If anyone can offer some instruction, I'd appreciate it.

Kurt


================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1



================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1

Reply via email to