That's an interesting thought.

Generate the list of files, sort unique by size,then hash only those
that match each other.

Excellent idea. That should speed things up quite a bit by eliminating
processor time for hashing.

Thanks!

Kurt



On Thu, Jul 30, 2015 at 6:12 PM, Weber, Mark A <[email protected]> wrote:
> If your (only) goal is to find duplicate files, you might want to add logic 
> to only hash files that are the same size - m
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] 
> On Behalf Of Kurt Buff
> Sent: Thursday, July 30, 2015 6:29 PM
> To: [email protected]
> Subject: Re: [powershell] Need some pointers on an exercise I've set for 
> myself
>
> Outstanding.
>
> I worked really hard at getting option 3 to work, and with your help that 
> works for me now.
>
> But, can you tell me if any of these approaches will put significantly less 
> pressure on memory than the other two?
>
> I think break this down a bit further, though. I think that exporting the 
> data with the hashes to another file, then using that file to do the sorting 
> and detecting the duplicates
>
> Kurt
>
> Kurt
>
> On Thu, Jul 30, 2015 at 3:33 PM, Bailey, Doug <[email protected]> wrote:
>> In the sample code that you provided, you're only outputting the result of 
>> Get-FileHash to the pipeline but it sounds like you want to add it to what's 
>> in the Import-CSV objects. Here are 3 methods of doing that:
>>
>> Get-ChildItem . -Recurse | select length, fullname | export-csv
>> -NoTypeInformation $env:TEMP\files.csv
>>
>> # Method 1
>> # Add-Member with -PassThru
>> Import-CSV $env:TEMP\files.csv | ForEach-Object {
>>     $_ | Add-Member -MemberType NoteProperty -Name Hash -Value
>> (get-filehash -algorithm md5 $_.FullName).Hash -PassThru } | Sort hash
>>
>>
>> # Method 2
>> # Create a PSCustomObject with hash table Import-CSV
>> $env:TEMP\files.csv | ForEach-Object {
>>     [pscustomObject] @{
>>         Hash = (get-filehash -algorithm md5 $_.FullName).Hash
>>         Legth = $_.length
>>         Path = $_.FullName
>>     }
>> }  | Sort hash
>>
>> # Method 3
>> # Select-Object with Name/Expression hash table as a Property
>> parameter Import-CSV $env:TEMP\files.csv | Select-Object -Property
>> @{Name="Hash";Expression={(get-filehash -algorithm md5
>> $_.FullName).Hash}},Length,FullName | Sort hash
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Kurt Buff
>> Sent: Thursday, July 30, 2015 5:09 PM
>> To: [email protected]
>> Subject: Re: [powershell] Need some pointers on an exercise I've set
>> for myself
>>
>> File store approaches 3tb now - just about 290gb free on a 3.1tb partition.
>>
>> The concern is that I've noticed a fair number of ISO files (and potentially 
>> a lot of other files including zip and other archives, and mpegs, etc.) that 
>> seem to be duplicates of each other.
>>
>> I want to generate a report for the VP of engineering, and let him know how 
>> bad the situation is - I'm going to guess there's close to 1tb of redundancy 
>> currently.
>>
>> Yes, this will consume hours of time, but I can launch it over a weekend and 
>> take a look on the Monday following.
>>
>> I like your idea of restartability, though - it's worth looking at as a 
>> secondary goal.
>>
>> Kurt
>>
>> On Thu, Jul 30, 2015 at 2:19 PM, James Button <[email protected]> 
>> wrote:
>>> Not experienced enough in powershell to suggest code, BUT I would
>>> advise that you make the process run as a restartable facility such that 
>>> the process can be interrupted ( if not by escape or ctrl+c, then by task 
>>> killing) and then, when restarted will continue processing a list of files 
>>> from the one after the last one for which a result was recorded.
>>>
>>> Working on the basis that you have a 1TB file store, and are working 
>>> towards a 3, or 6TB filestore, even assuming your filestore connection runs 
>>> at 8Gb/sec, as in 60GB per minute, that's surely going to be an hours full 
>>> time use of the interface, and I'd really expect the hashing process to 
>>> take getting on for a day elapsed if the system is running - spinning media 
>>> on a more common interface connection, rather than a solid state store on 
>>> the fastest possible multi-channel interface.
>>>
>>> You may also need to consider the system overhead in assembling the
>>> list of files - sheer volume of the MFT to be processed, I know from
>>> a fair amount of the restructuring work I used to do for clients on a 4GB 
>>> memory system  with caddy'd drives - Such as renaming files that filled a 
>>> 1TB drive, for access as 'home drives' - before you had all the maintenance 
>>> goodies in the admin facilities.
>>>
>>> (Having taken a complete list of files, stuck them in Excel, sorted
>>> them there, and generated a set of rename commands.)
>>>
>>> It took more time processing the MFT entries to "rename" the files in situ 
>>> - than it did to copy them to another drive with the new names.
>>> Simply because of the thrashing on the MFT blocks in the OS allocated disk 
>>> read cache.
>>>
>>> JimB
>>>
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of Kurt Buff
>>> Sent: Thursday, July 30, 2015 8:45 PM
>>> To: [email protected]
>>> Subject: [powershell] Need some pointers on an exercise I've set for
>>> myself
>>>
>>> I'm putting together what should be a simple little script, and failing.
>>>
>>> I am ultimately looking to run this against a directory, then sort
>>> the output on the hash field and then parse for duplicates. There are
>>> two conditions that concern me: 1) there are over 3m files in the
>>> target directory, and 2) many of the files are quite large, over 1g.
>>>
>>> I'm more concerned about the effects of the script on memory than on
>>> processor - the data is fairly static, and I intend to run it once a
>>> month or even less, but I did choose MD5 as the hash algorithm for
>>> speed, rather than accept the default of SHA256.
>>>
>>> This is pretty simple stuff, I'm sure, but I'm using this as a
>>> learning exercise more than anything, as there are duplicate file
>>> finders out in the world already.
>>>
>>> There are several problems with what I have put together so far,
>>> which this this:
>>>
>>>      Get-ChildItem c:\stuff -Recurse | select length, fullname |
>>> export-csv -NoTypeInformation c:\temp\files.csv
>>>      Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash
>>> -algorithm md5 $_.FullName) }; Length | Sort hash
>>>
>>> Using Length (or $_.Length) anywhere in the foreach statement gives
>>> an error, or gives weird output.
>>>
>>> Sample Output when not using Length, and therefore getting reasonable
>>> output (extra spaces and hyphen delimiters elided):
>>>      Algorithm   Hash
>>>         Path
>>>      MD5          592BE1AD0ED83C36D5E68CA7A014A510   
>>> C:\stuff\Tools\SomeFile.DOC
>>>
>>> What I'd like to see instead
>>>      Hash                                                          Length   
>>> Path
>>>      592BE1AD0ED83C36D5E68CA7A014A510    79872    
>>> C:\stuff\Tools\SomeFile.DOC
>>>
>>> If anyone can offer some instruction, I'd appreciate it.
>>>
>>> Kurt
>>>
>>>
>>> ================================================
>>> Did you know you can also post and find answers on PowerShell in the forums?
>>> http://www.myitforum.com/forums/default.asp?catApp=1
>>>
>>>
>>>
>>> ================================================
>>> Did you know you can also post and find answers on PowerShell in the forums?
>>> http://www.myitforum.com/forums/default.asp?catApp=1
>>>
>>
>>
>> ================================================
>> Did you know you can also post and find answers on PowerShell in the forums?
>> http://www.myitforum.com/forums/default.asp?catApp=1
>>
>>
>> ================================================
>> Did you know you can also post and find answers on PowerShell in the forums?
>> http://www.myitforum.com/forums/default.asp?catApp=1
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1


================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1

Reply via email to