RE: [powershell] Re: Need some pointers on an exercise I've set for myself

Crawford Scott Mon, 03 Aug 2015 23:55:43 -0700

And as another optimization, only hash files that have another file of the same 
size. There's no point in haunt a bunch of files that have no chance of a 
duplicate.

Sent from my Windows Phone
________________________________
From: Michael B. Smith<mailto:[email protected]>
Sent: ‎8/‎3/‎2015 11:21 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: [powershell] Re: Need some pointers on an exercise I've set for 
myself

[int] is a bad plan. You should be using [int64].

You didn't sort based on hash, you sorted based on Length ( Sort-Object 
@{Expression={$_.Length -as [int]} ). Because you truncated Length to a 32 bit 
signed integer, of course you had files that didn't behave as expected. And I'm 
sure they are all over 2 GB in size.

Your algorithm is far too complex. You should use a hashtable instead and check 
for a particular Key's presence. Should reduce your logic to about 10 lines.

-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Kurt Buff
Sent: Monday, August 3, 2015 8:26 PM
To: [email protected]
Subject: [powershell] Re: Need some pointers on an exercise I've set for myself

Replying to myself, since that seems the reasonable thing to do here.

I've tested the following against a smaller directory that I know has some 
duplicates, and am getting progress. Here is what I have so far (work with the 
line wraps!):

Get-ChildItem S:\ -File -Recurse | select fullname, length | Export-CSV 
-NoTypeInformation c:\temp\files.csv

Import-CSV c:\temp\files.csv | Select-Object -Property 
@{Name="MD5";Expression={(Get-Filehash -algorithm md5 
$_.FullName).MD5}},Length,FullName | Export-CSV -NoTypeInformation 
c:\temp\filehash.csv

Import-CSV C:\temp\checker\fileMD5.csv | Sort-Object @{Expression={$_.Length 
-as [int]}} | Export-CSV -NoTypeInformation c:\temp\checker\FileMD5Sorted.csv

The above generates a file of 315286 lines (not including header) - of course, 
that's the number of files in the directory tree. I get output that looks like 
this (work with the line wraps again):

"MD5","Length","FullName"
"6467C3875955DF4514395F0AFCAAA62A","3182604288","S:\Infrastructure\Microsoft\OSes\Win7EntSP1_64bit\SW_DVD5_SA_Win_Ent_7w_SP1_64BIT_English_-2_MLF_X17-58882.ISO"

I noticed two oddities, however:

o- zero-length files generate a hash, and of course the hash is the same for 
all of them. I probably should have expected that, but it surprised me.

o- I find a handful of files (22 of them) at the top of the csv file after 
sorting that don't seem to obey the sorting on the hash that the other files 
followed. It's very strange. They're not duplicates of any other files; their 
hashes and file sizes are out of sort order from all of the rest, AFAICT. I'm 
not sure what to make of that.

But, ignoring those two things, I'd like to proceed a bit further:

o- Writing to another file only those lines that are duplicate files, which I 
can do by selecting selecting the lines that have matching hashes (and possibly 
also matching sizes)

o- Possibly adding another column, which would contain an integer that would 
increment for each set of matched files, which would probably lead to...

o- Among other things, calculating the amount of duplicated space (sum of n-1 
file sizes for each set of dupes), identifying duplicate directories that can 
be eliminated in toto, etc.

But, I'm stymied on the execution of the logic. I'm such an inexperienced 
programmer that I'm flailing on the first of these steps. I believe I need to 
make a stepwise comparison of the MD5 column, which I think would look 
something like this:

     $dupe = 1
     read infile.line1 into variable1
     read infile.line2 into variable2
     if {
          variable1.MD5 -eq variable2.MD5
          prefix variable1 with dupe counter
          write variable1 to the new csv file
          while not eof
               set variable1 to the contents of variable2
               read line next into variable2
               compare variable1.MD5 to variable2.MD5
          if match
               prefix variable1 with $dupe
               append variable1 as new line of new csv file
          else
               increment dupe counter
     endwhile }
    else {
          while not eof
               set variable1 to the contents of variable2
               read line next into variable2
               compare variable1.MD5 to variable2.MD5
          if match
               prefix variable1 with $dupe
               append variable1 as new line of new csv file
          else
               increment dupe counter
     endwhile

I realize I could be way off base on the algorithm here, but that's what I've 
been able to dream up.

Anyone care to critique and offer syntax suggestions - my googlefu is about 
exhausted.

Kurt

On Thu, Jul 30, 2015 at 12:45 PM, Kurt Buff <[email protected]> wrote:
> I'm putting together what should be a simple little script, and failing.
>
> I am ultimately looking to run this against a directory, then sort the
> output on the hash field and then parse for duplicates. There are two
> conditions that concern me: 1) there are over 3m files in the target
> directory, and 2) many of the files are quite large, over 1g.
>
> I'm more concerned about the effects of the script on memory than on
> processor - the data is fairly static, and I intend to run it once a
> month or even less, but I did choose MD5 as the hash algorithm for
> speed, rather than accept the default of SHA256.
>
> This is pretty simple stuff, I'm sure, but I'm using this as a
> learning exercise more than anything, as there are duplicate file
> finders out in the world already.
>
> There are several problems with what I have put together so far, which
> this this:
>
>      Get-ChildItem c:\stuff -Recurse | select length, fullname |
> export-csv -NoTypeInformation c:\temp\files.csv
>      Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash
> -algorithm md5 $_.FullName) }; Length | Sort hash
>
> Using Length (or $_.Length) anywhere in the foreach statement gives an
> error, or gives weird output.
>
> Sample Output when not using Length, and therefore getting reasonable
> output (extra spaces and hyphen delimiters elided):
>      Algorithm   Hash
>         Path
>      MD5          592BE1AD0ED83C36D5E68CA7A014A510   
> C:\stuff\Tools\SomeFile.DOC
>
> What I'd like to see instead
>      Hash                                                          Length   
> Path
>      592BE1AD0ED83C36D5E68CA7A014A510    79872    C:\stuff\Tools\SomeFile.DOC
>
> If anyone can offer some instruction, I'd appreciate it.
>
> Kurt

================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1

================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1

================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1

RE: [powershell] Re: Need some pointers on an exercise I've set for myself

Reply via email to