That's an interesting thought. Generate the list of files, sort unique by size,then hash only those that match each other.
Excellent idea. That should speed things up quite a bit by eliminating processor time for hashing. Thanks! Kurt On Thu, Jul 30, 2015 at 6:12 PM, Weber, Mark A <[email protected]> wrote: > If your (only) goal is to find duplicate files, you might want to add logic > to only hash files that are the same size - m > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > On Behalf Of Kurt Buff > Sent: Thursday, July 30, 2015 6:29 PM > To: [email protected] > Subject: Re: [powershell] Need some pointers on an exercise I've set for > myself > > Outstanding. > > I worked really hard at getting option 3 to work, and with your help that > works for me now. > > But, can you tell me if any of these approaches will put significantly less > pressure on memory than the other two? > > I think break this down a bit further, though. I think that exporting the > data with the hashes to another file, then using that file to do the sorting > and detecting the duplicates > > Kurt > > Kurt > > On Thu, Jul 30, 2015 at 3:33 PM, Bailey, Doug <[email protected]> wrote: >> In the sample code that you provided, you're only outputting the result of >> Get-FileHash to the pipeline but it sounds like you want to add it to what's >> in the Import-CSV objects. Here are 3 methods of doing that: >> >> Get-ChildItem . -Recurse | select length, fullname | export-csv >> -NoTypeInformation $env:TEMP\files.csv >> >> # Method 1 >> # Add-Member with -PassThru >> Import-CSV $env:TEMP\files.csv | ForEach-Object { >> $_ | Add-Member -MemberType NoteProperty -Name Hash -Value >> (get-filehash -algorithm md5 $_.FullName).Hash -PassThru } | Sort hash >> >> >> # Method 2 >> # Create a PSCustomObject with hash table Import-CSV >> $env:TEMP\files.csv | ForEach-Object { >> [pscustomObject] @{ >> Hash = (get-filehash -algorithm md5 $_.FullName).Hash >> Legth = $_.length >> Path = $_.FullName >> } >> } | Sort hash >> >> # Method 3 >> # Select-Object with Name/Expression hash table as a Property >> parameter Import-CSV $env:TEMP\files.csv | Select-Object -Property >> @{Name="Hash";Expression={(get-filehash -algorithm md5 >> $_.FullName).Hash}},Length,FullName | Sort hash >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Kurt Buff >> Sent: Thursday, July 30, 2015 5:09 PM >> To: [email protected] >> Subject: Re: [powershell] Need some pointers on an exercise I've set >> for myself >> >> File store approaches 3tb now - just about 290gb free on a 3.1tb partition. >> >> The concern is that I've noticed a fair number of ISO files (and potentially >> a lot of other files including zip and other archives, and mpegs, etc.) that >> seem to be duplicates of each other. >> >> I want to generate a report for the VP of engineering, and let him know how >> bad the situation is - I'm going to guess there's close to 1tb of redundancy >> currently. >> >> Yes, this will consume hours of time, but I can launch it over a weekend and >> take a look on the Monday following. >> >> I like your idea of restartability, though - it's worth looking at as a >> secondary goal. >> >> Kurt >> >> On Thu, Jul 30, 2015 at 2:19 PM, James Button <[email protected]> >> wrote: >>> Not experienced enough in powershell to suggest code, BUT I would >>> advise that you make the process run as a restartable facility such that >>> the process can be interrupted ( if not by escape or ctrl+c, then by task >>> killing) and then, when restarted will continue processing a list of files >>> from the one after the last one for which a result was recorded. >>> >>> Working on the basis that you have a 1TB file store, and are working >>> towards a 3, or 6TB filestore, even assuming your filestore connection runs >>> at 8Gb/sec, as in 60GB per minute, that's surely going to be an hours full >>> time use of the interface, and I'd really expect the hashing process to >>> take getting on for a day elapsed if the system is running - spinning media >>> on a more common interface connection, rather than a solid state store on >>> the fastest possible multi-channel interface. >>> >>> You may also need to consider the system overhead in assembling the >>> list of files - sheer volume of the MFT to be processed, I know from >>> a fair amount of the restructuring work I used to do for clients on a 4GB >>> memory system with caddy'd drives - Such as renaming files that filled a >>> 1TB drive, for access as 'home drives' - before you had all the maintenance >>> goodies in the admin facilities. >>> >>> (Having taken a complete list of files, stuck them in Excel, sorted >>> them there, and generated a set of rename commands.) >>> >>> It took more time processing the MFT entries to "rename" the files in situ >>> - than it did to copy them to another drive with the new names. >>> Simply because of the thrashing on the MFT blocks in the OS allocated disk >>> read cache. >>> >>> JimB >>> >>> >>> -----Original Message----- >>> From: [email protected] >>> [mailto:[email protected]] On Behalf Of Kurt Buff >>> Sent: Thursday, July 30, 2015 8:45 PM >>> To: [email protected] >>> Subject: [powershell] Need some pointers on an exercise I've set for >>> myself >>> >>> I'm putting together what should be a simple little script, and failing. >>> >>> I am ultimately looking to run this against a directory, then sort >>> the output on the hash field and then parse for duplicates. There are >>> two conditions that concern me: 1) there are over 3m files in the >>> target directory, and 2) many of the files are quite large, over 1g. >>> >>> I'm more concerned about the effects of the script on memory than on >>> processor - the data is fairly static, and I intend to run it once a >>> month or even less, but I did choose MD5 as the hash algorithm for >>> speed, rather than accept the default of SHA256. >>> >>> This is pretty simple stuff, I'm sure, but I'm using this as a >>> learning exercise more than anything, as there are duplicate file >>> finders out in the world already. >>> >>> There are several problems with what I have put together so far, >>> which this this: >>> >>> Get-ChildItem c:\stuff -Recurse | select length, fullname | >>> export-csv -NoTypeInformation c:\temp\files.csv >>> Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash >>> -algorithm md5 $_.FullName) }; Length | Sort hash >>> >>> Using Length (or $_.Length) anywhere in the foreach statement gives >>> an error, or gives weird output. >>> >>> Sample Output when not using Length, and therefore getting reasonable >>> output (extra spaces and hyphen delimiters elided): >>> Algorithm Hash >>> Path >>> MD5 592BE1AD0ED83C36D5E68CA7A014A510 >>> C:\stuff\Tools\SomeFile.DOC >>> >>> What I'd like to see instead >>> Hash Length >>> Path >>> 592BE1AD0ED83C36D5E68CA7A014A510 79872 >>> C:\stuff\Tools\SomeFile.DOC >>> >>> If anyone can offer some instruction, I'd appreciate it. >>> >>> Kurt >>> >>> >>> ================================================ >>> Did you know you can also post and find answers on PowerShell in the forums? >>> http://www.myitforum.com/forums/default.asp?catApp=1 >>> >>> >>> >>> ================================================ >>> Did you know you can also post and find answers on PowerShell in the forums? >>> http://www.myitforum.com/forums/default.asp?catApp=1 >>> >> >> >> ================================================ >> Did you know you can also post and find answers on PowerShell in the forums? >> http://www.myitforum.com/forums/default.asp?catApp=1 >> >> >> ================================================ >> Did you know you can also post and find answers on PowerShell in the forums? >> http://www.myitforum.com/forums/default.asp?catApp=1 > > > ================================================ > Did you know you can also post and find answers on PowerShell in the forums? > http://www.myitforum.com/forums/default.asp?catApp=1 > > > ================================================ > Did you know you can also post and find answers on PowerShell in the forums? > http://www.myitforum.com/forums/default.asp?catApp=1 ================================================ Did you know you can also post and find answers on PowerShell in the forums? http://www.myitforum.com/forums/default.asp?catApp=1
