On Mon, Aug 3, 2015 at 9:20 PM, Michael B. Smith <[email protected]> wrote:
> [int] is a bad plan. You should be using [int64].
Makes sense. Thanks.
> You didn't sort based on hash, you sorted based on Length ( Sort-Object
> @{Expression={$_.Length -as [int]} ). Because you truncated Length to a 32
> bit signed integer, of course you had files that didn't behave as expected.
> And I'm sure they are all over 2 GB in size.
Argh. You are correct. I have fixed that.
> Your algorithm is far too complex. You should use a hashtable instead and
> check for a particular Key's presence. Should reduce your logic to about 10
> lines.
OK - but will that drag the file server to its knees because of memory
usage? Remember, ultimately I want to run this against a partition
that has 3+million files on it,
Regardless, I will look that up, and see what I can figure out before
I come back here.
Kurt
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Kurt Buff
> Sent: Monday, August 3, 2015 8:26 PM
> To: [email protected]
> Subject: [powershell] Re: Need some pointers on an exercise I've set for
> myself
>
> Replying to myself, since that seems the reasonable thing to do here.
>
> I've tested the following against a smaller directory that I know has some
> duplicates, and am getting progress. Here is what I have so far (work with
> the line wraps!):
>
> Get-ChildItem S:\ -File -Recurse | select fullname, length | Export-CSV
> -NoTypeInformation c:\temp\files.csv
>
> Import-CSV c:\temp\files.csv | Select-Object -Property
> @{Name="MD5";Expression={(Get-Filehash -algorithm md5
> $_.FullName).MD5}},Length,FullName | Export-CSV -NoTypeInformation
> c:\temp\filehash.csv
>
> Import-CSV C:\temp\checker\fileMD5.csv | Sort-Object @{Expression={$_.Length
> -as [int]}} | Export-CSV -NoTypeInformation c:\temp\checker\FileMD5Sorted.csv
>
> The above generates a file of 315286 lines (not including header) - of
> course, that's the number of files in the directory tree. I get output that
> looks like this (work with the line wraps again):
>
> "MD5","Length","FullName"
> "6467C3875955DF4514395F0AFCAAA62A","3182604288","S:\Infrastructure\Microsoft\OSes\Win7EntSP1_64bit\SW_DVD5_SA_Win_Ent_7w_SP1_64BIT_English_-2_MLF_X17-58882.ISO"
>
> I noticed two oddities, however:
>
> o- zero-length files generate a hash, and of course the hash is the same for
> all of them. I probably should have expected that, but it surprised me.
>
> o- I find a handful of files (22 of them) at the top of the csv file after
> sorting that don't seem to obey the sorting on the hash that the other files
> followed. It's very strange. They're not duplicates of any other files; their
> hashes and file sizes are out of sort order from all of the rest, AFAICT. I'm
> not sure what to make of that.
>
> But, ignoring those two things, I'd like to proceed a bit further:
>
> o- Writing to another file only those lines that are duplicate files, which I
> can do by selecting selecting the lines that have matching hashes (and
> possibly also matching sizes)
>
> o- Possibly adding another column, which would contain an integer that would
> increment for each set of matched files, which would probably lead to...
>
> o- Among other things, calculating the amount of duplicated space (sum of n-1
> file sizes for each set of dupes), identifying duplicate directories that can
> be eliminated in toto, etc.
>
> But, I'm stymied on the execution of the logic. I'm such an inexperienced
> programmer that I'm flailing on the first of these steps. I believe I need to
> make a stepwise comparison of the MD5 column, which I think would look
> something like this:
>
> $dupe = 1
> read infile.line1 into variable1
> read infile.line2 into variable2
> if {
> variable1.MD5 -eq variable2.MD5
> prefix variable1 with dupe counter
> write variable1 to the new csv file
> while not eof
> set variable1 to the contents of variable2
> read line next into variable2
> compare variable1.MD5 to variable2.MD5
> if match
> prefix variable1 with $dupe
> append variable1 as new line of new csv file
> else
> increment dupe counter
> endwhile }
> else {
> while not eof
> set variable1 to the contents of variable2
> read line next into variable2
> compare variable1.MD5 to variable2.MD5
> if match
> prefix variable1 with $dupe
> append variable1 as new line of new csv file
> else
> increment dupe counter
> endwhile
>
> I realize I could be way off base on the algorithm here, but that's what I've
> been able to dream up.
>
> Anyone care to critique and offer syntax suggestions - my googlefu is about
> exhausted.
>
> Kurt
>
> On Thu, Jul 30, 2015 at 12:45 PM, Kurt Buff <[email protected]> wrote:
>> I'm putting together what should be a simple little script, and failing.
>>
>> I am ultimately looking to run this against a directory, then sort the
>> output on the hash field and then parse for duplicates. There are two
>> conditions that concern me: 1) there are over 3m files in the target
>> directory, and 2) many of the files are quite large, over 1g.
>>
>> I'm more concerned about the effects of the script on memory than on
>> processor - the data is fairly static, and I intend to run it once a
>> month or even less, but I did choose MD5 as the hash algorithm for
>> speed, rather than accept the default of SHA256.
>>
>> This is pretty simple stuff, I'm sure, but I'm using this as a
>> learning exercise more than anything, as there are duplicate file
>> finders out in the world already.
>>
>> There are several problems with what I have put together so far, which
>> this this:
>>
>> Get-ChildItem c:\stuff -Recurse | select length, fullname |
>> export-csv -NoTypeInformation c:\temp\files.csv
>> Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash
>> -algorithm md5 $_.FullName) }; Length | Sort hash
>>
>> Using Length (or $_.Length) anywhere in the foreach statement gives an
>> error, or gives weird output.
>>
>> Sample Output when not using Length, and therefore getting reasonable
>> output (extra spaces and hyphen delimiters elided):
>> Algorithm Hash
>> Path
>> MD5 592BE1AD0ED83C36D5E68CA7A014A510
>> C:\stuff\Tools\SomeFile.DOC
>>
>> What I'd like to see instead
>> Hash Length
>> Path
>> 592BE1AD0ED83C36D5E68CA7A014A510 79872 C:\stuff\Tools\SomeFile.DOC
>>
>> If anyone can offer some instruction, I'd appreciate it.
>>
>> Kurt
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1
================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1