RE: [powershell] Re: Need some pointers on an exercise I've set for myself

Michael B. Smith Wed, 05 Aug 2015 10:36:20 -0700

Where is that error coming from? When you generate the file list?

-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Kurt Buff
Sent: Wednesday, August 5, 2015 1:26 PM
To: [email protected]
Subject: Re: [powershell] Re: Need some pointers on an exercise I've set for 
myself


Many thanks for this - I believe I have understood your explanation, and I've 
updated my script, and it works.

Unfortunately, now I have to deal with path length problems. I'm getting lots 
of the following message:

Get-ChildItem : The specified path, file name, or both are too long.
The fully qualified file name must be less than 260 characters, and the 
directory name must be less than 248 characters.

I'm doing some research on this now, but I'm guessing this will cause problems 
throughout the script.

I've found this article, which seems relatively comprehensive, and am following 
up on the alternatives it provides.

Kurt

On Tue, Aug 4, 2015 at 7:07 PM, Michael B. Smith <[email protected]> wrote:
> You don't want to Group-Object with -AsHashTable. That means the object 
> passed to Where-Object is a hash table which contains the results of both 
> hashes. Which has a count of 2 since there are 2 unique hashes in the input 
> file.
>
> This gives you the grouping you want:
>
>         Import-Csv C:\temp\FileHashSortedOnHash.csv | Group-Object 
> -property Hash | Where-Object { $_.count -gt 1 }
>
> To get the output you want, you do this:
>
>         Import-Csv C:\temp\FileHashSortedOnHash.csv | Group-Object 
> -property Hash | Where-Object { $_.count -gt 1 } | Select -Expand 
> Group | Export-Csv -NoTypeInformation c:\temp\test.csv
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Kurt Buff
> Sent: Tuesday, August 4, 2015 8:29 PM
> To: [email protected]
> Subject: Re: [powershell] Re: Need some pointers on an exercise I've 
> set for myself
>
> OK - once more replying to myself. Trying to work with a hashtable, and I 
> know I'm doing something really silly, but after looking through the web, and 
> my two powershell books, I'm not figuring it out.
>
> The datafile looks like this (obviously much truncated from the 
> 3mb/31k line file that contains the listing for the entire partition, 
> but notice that I have a singleton entry, and then 4 entries that are 
> the same size, but with the same hash, though the file names are
> different):
>
> ----------Begin Data File----------
> "Hash","Length","Fullname"
> "004B863286B718C1FAC71D017C20BEE2","41342","S:\Infrastructure\Microsoft\Office\1997\CLIPART\PHOTOS\FAMILY\WALKING1.JPG"
> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll"
> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll"
> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll"
> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll"
> ----------End Data File---------
>
> If I run this command against it:
> Import-Csv C:\temp\FileHashSortedOnHash.csv | Select-Object Hash, FullName | 
> Group-Object -property Hash -AsHashTable | Where-Object { $_.count -gt 1 } | 
> fl I get the following output:
>
> ----------Begin PS Output---------
> Name  : 004B863286B718C1FAC71D017C20BEE2 Value : 
> {@{Hash=004B863286B718C1FAC71D017C20BEE2;
> FullName=S:\Infrastructure\Microsoft\Office\1997\CLIPART\PHOTOS\FAMILY
> \WALKING1.JPG}}
>
> Name  : 0038BE5708BB485D8CF13E290D65CE3C Value : 
> {@{Hash=0038BE5708BB485D8CF13E290D65CE3C;
> FullName=S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll},
>         @{Hash=0038BE5708BB485D8CF13E290D65CE3C;
>         
> FullName=S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll},
>         @{Hash=0038BE5708BB485D8CF13E290D65CE3C;
>         
> FullName=S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll},
>         @{Hash=0038BE5708BB485D8CF13E290D65CE3C;
>         
> FullName=S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sq
> l_engine_core_shared_msi\Windows\Gac\itlr0xln.dll}}
> ----------End PS Output---------
>
> BTW - the output poses the question: If I'm looking for only items that have 
> a count of more than one, my am I getting the first name/value pair? I 
> believe that it's because I'm not doing something correctly, but can't figure 
> that out.
>
> If I try to export as a CSV file instead of letting it Format-List to the 
> screen, I get this:
>
> ----------Begin CSV Output--------
> "IsReadOnly","IsFixedSize","IsSynchronized","Keys","Values","SyncRoot","Count"
> "False","False","False","System.Collections.Hashtable+KeyCollection","System.Collections.Hashtable+ValueCollection","System.Object","33"
> ----------End CSV Ouput----------
>
> What I'd like to see is another CSV file, that looks like this the one 
> following - notice that the only difference is that the singleton is
> gone:
>
> ----------Begin Data File----------
> "Hash","Length","Fullname"
> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll"
> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll"
> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll"
> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll"
> ----------End Data File---------
>
> Thanks again for any help.
>
> Kurt
>
> On Mon, Aug 3, 2015 at 5:25 PM, Kurt Buff <[email protected]> wrote:
>> Replying to myself, since that seems the reasonable thing to do here.
>>
>> I've tested the following against a smaller directory that I know has 
>> some duplicates, and am getting progress. Here is what I have so far 
>> (work with the line wraps!):
>>
>> Get-ChildItem S:\ -File -Recurse | select fullname, length | 
>> Export-CSV -NoTypeInformation c:\temp\files.csv
>>
>> Import-CSV c:\temp\files.csv | Select-Object -Property 
>> @{Name="MD5";Expression={(Get-Filehash -algorithm md5 
>> $_.FullName).MD5}},Length,FullName | Export-CSV -NoTypeInformation 
>> c:\temp\filehash.csv
>>
>> Import-CSV C:\temp\checker\fileMD5.csv | Sort-Object 
>> @{Expression={$_.Length -as [int]}} | Export-CSV -NoTypeInformation 
>> c:\temp\checker\FileMD5Sorted.csv
>>
>> The above generates a file of 315286 lines (not including header) - 
>> of course, that's the number of files in the directory tree. I get 
>> output that looks like this (work with the line wraps again):
>>
>> "MD5","Length","FullName"
>> "6467C3875955DF4514395F0AFCAAA62A","3182604288","S:\Infrastructure\Microsoft\OSes\Win7EntSP1_64bit\SW_DVD5_SA_Win_Ent_7w_SP1_64BIT_English_-2_MLF_X17-58882.ISO"
>>
>> I noticed two oddities, however:
>>
>> o- zero-length files generate a hash, and of course the hash is the 
>> same for all of them. I probably should have expected that, but it 
>> surprised me.
>>
>> o- I find a handful of files (22 of them) at the top of the csv file 
>> after sorting that don't seem to obey the sorting on the hash that 
>> the other files followed. It's very strange. They're not duplicates 
>> of any other files; their hashes and file sizes are out of sort order 
>> from all of the rest, AFAICT. I'm not sure what to make of that.
>>
>> But, ignoring those two things, I'd like to proceed a bit further:
>>
>> o- Writing to another file only those lines that are duplicate files, 
>> which I can do by selecting selecting the lines that have matching 
>> hashes (and possibly also matching sizes)
>>
>> o- Possibly adding another column, which would contain an integer 
>> that would increment for each set of matched files, which would 
>> probably lead to...
>>
>> o- Among other things, calculating the amount of duplicated space 
>> (sum of n-1 file sizes for each set of dupes), identifying duplicate 
>> directories that can be eliminated in toto, etc.
>>
>> But, I'm stymied on the execution of the logic. I'm such an 
>> inexperienced programmer that I'm flailing on the first of these 
>> steps. I believe I need to make a stepwise comparison of the MD5 
>> column, which I think would look something like this:
>>
>>      $dupe = 1
>>      read infile.line1 into variable1
>>      read infile.line2 into variable2
>>      if {
>>           variable1.MD5 -eq variable2.MD5
>>           prefix variable1 with dupe counter
>>           write variable1 to the new csv file
>>           while not eof
>>                set variable1 to the contents of variable2
>>                read line next into variable2
>>                compare variable1.MD5 to variable2.MD5
>>           if match
>>                prefix variable1 with $dupe
>>                append variable1 as new line of new csv file
>>           else
>>                increment dupe counter
>>      endwhile }
>>     else {
>>           while not eof
>>                set variable1 to the contents of variable2
>>                read line next into variable2
>>                compare variable1.MD5 to variable2.MD5
>>           if match
>>                prefix variable1 with $dupe
>>                append variable1 as new line of new csv file
>>           else
>>                increment dupe counter
>>      endwhile
>>
>> I realize I could be way off base on the algorithm here, but that's 
>> what I've been able to dream up.
>>
>> Anyone care to critique and offer syntax suggestions - my googlefu is 
>> about exhausted.
>>
>> Kurt
>>
>> On Thu, Jul 30, 2015 at 12:45 PM, Kurt Buff <[email protected]> wrote:
>>> I'm putting together what should be a simple little script, and failing.
>>>
>>> I am ultimately looking to run this against a directory, then sort 
>>> the output on the hash field and then parse for duplicates. There 
>>> are two conditions that concern me: 1) there are over 3m files in 
>>> the target directory, and 2) many of the files are quite large, over 1g.
>>>
>>> I'm more concerned about the effects of the script on memory than on 
>>> processor - the data is fairly static, and I intend to run it once a 
>>> month or even less, but I did choose MD5 as the hash algorithm for 
>>> speed, rather than accept the default of SHA256.
>>>
>>> This is pretty simple stuff, I'm sure, but I'm using this as a 
>>> learning exercise more than anything, as there are duplicate file 
>>> finders out in the world already.
>>>
>>> There are several problems with what I have put together so far, 
>>> which this this:
>>>
>>>      Get-ChildItem c:\stuff -Recurse | select length, fullname | 
>>> export-csv -NoTypeInformation c:\temp\files.csv
>>>      Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash 
>>> -algorithm md5 $_.FullName) }; Length | Sort hash
>>>
>>> Using Length (or $_.Length) anywhere in the foreach statement gives 
>>> an error, or gives weird output.
>>>
>>> Sample Output when not using Length, and therefore getting 
>>> reasonable output (extra spaces and hyphen delimiters elided):
>>>      Algorithm   Hash
>>>         Path
>>>      MD5          592BE1AD0ED83C36D5E68CA7A014A510   
>>> C:\stuff\Tools\SomeFile.DOC
>>>
>>> What I'd like to see instead
>>>      Hash                                                          Length   
>>> Path
>>>      592BE1AD0ED83C36D5E68CA7A014A510    79872    
>>> C:\stuff\Tools\SomeFile.DOC
>>>
>>> If anyone can offer some instruction, I'd appreciate it.
>>>
>>> Kurt
>>
>>
>> ================================================
>> Did you know you can also post and find answers on PowerShell in the forums?
>> http://www.myitforum.com/forums/default.asp?catApp=1
>>
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1


================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1


================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1

RE: [powershell] Re: Need some pointers on an exercise I've set for myself

Reply via email to