Re: [powershell] Re: Need some pointers on an exercise I've set for myself

Kurt Buff Wed, 05 Aug 2015 11:22:12 -0700

Sigh. I fubared the first line of the script, when changing it to test
against a bigger directory.


This:
Get-ChildItem "g:\groups information technology" -File -Recurse |
select length, fullname | export-csv -NoTypeInformation
c:\temp\IT-files.csv

Should be this:
Get-ChildItem "g:\groups\information technology" -File -Recurse |
select length, fullname | export-csv -NoTypeInformation
c:\temp\IT-files.csv

However, I do know that the partition that I'm going to be running
this against 'for real' does contain some really deep trees, because I
run a batch file to catch them, and have nagged the owners of those
directories for years regarding the problems they are experiencing.

Kurt

On Wed, Aug 5, 2015 at 10:33 AM, Michael B. Smith <[email protected]> wrote:
> Where is that error coming from? When you generate the file list?
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] 
> On Behalf Of Kurt Buff
> Sent: Wednesday, August 5, 2015 1:26 PM
> To: [email protected]
> Subject: Re: [powershell] Re: Need some pointers on an exercise I've set for 
> myself
>
> Many thanks for this - I believe I have understood your explanation, and I've 
> updated my script, and it works.
>
> Unfortunately, now I have to deal with path length problems. I'm getting lots 
> of the following message:
>
> Get-ChildItem : The specified path, file name, or both are too long.
> The fully qualified file name must be less than 260 characters, and the 
> directory name must be less than 248 characters.
>
> I'm doing some research on this now, but I'm guessing this will cause 
> problems throughout the script.
>
> I've found this article, which seems relatively comprehensive, and am 
> following up on the alternatives it provides.
>
> Kurt
>
> On Tue, Aug 4, 2015 at 7:07 PM, Michael B. Smith <[email protected]> 
> wrote:
>> You don't want to Group-Object with -AsHashTable. That means the object 
>> passed to Where-Object is a hash table which contains the results of both 
>> hashes. Which has a count of 2 since there are 2 unique hashes in the input 
>> file.
>>
>> This gives you the grouping you want:
>>
>>         Import-Csv C:\temp\FileHashSortedOnHash.csv | Group-Object
>> -property Hash | Where-Object { $_.count -gt 1 }
>>
>> To get the output you want, you do this:
>>
>>         Import-Csv C:\temp\FileHashSortedOnHash.csv | Group-Object
>> -property Hash | Where-Object { $_.count -gt 1 } | Select -Expand
>> Group | Export-Csv -NoTypeInformation c:\temp\test.csv
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Kurt Buff
>> Sent: Tuesday, August 4, 2015 8:29 PM
>> To: [email protected]
>> Subject: Re: [powershell] Re: Need some pointers on an exercise I've
>> set for myself
>>
>> OK - once more replying to myself. Trying to work with a hashtable, and I 
>> know I'm doing something really silly, but after looking through the web, 
>> and my two powershell books, I'm not figuring it out.
>>
>> The datafile looks like this (obviously much truncated from the
>> 3mb/31k line file that contains the listing for the entire partition,
>> but notice that I have a singleton entry, and then 4 entries that are
>> the same size, but with the same hash, though the file names are
>> different):
>>
>> ----------Begin Data File----------
>> "Hash","Length","Fullname"
>> "004B863286B718C1FAC71D017C20BEE2","41342","S:\Infrastructure\Microsoft\Office\1997\CLIPART\PHOTOS\FAMILY\WALKING1.JPG"
>> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll"
>> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll"
>> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll"
>> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll"
>> ----------End Data File---------
>>
>> If I run this command against it:
>> Import-Csv C:\temp\FileHashSortedOnHash.csv | Select-Object Hash, FullName | 
>> Group-Object -property Hash -AsHashTable | Where-Object { $_.count -gt 1 } | 
>> fl I get the following output:
>>
>> ----------Begin PS Output---------
>> Name  : 004B863286B718C1FAC71D017C20BEE2 Value :
>> {@{Hash=004B863286B718C1FAC71D017C20BEE2;
>> FullName=S:\Infrastructure\Microsoft\Office\1997\CLIPART\PHOTOS\FAMILY
>> \WALKING1.JPG}}
>>
>> Name  : 0038BE5708BB485D8CF13E290D65CE3C Value :
>> {@{Hash=0038BE5708BB485D8CF13E290D65CE3C;
>> FullName=S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll},
>>         @{Hash=0038BE5708BB485D8CF13E290D65CE3C;
>>         
>> FullName=S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll},
>>         @{Hash=0038BE5708BB485D8CF13E290D65CE3C;
>>         
>> FullName=S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll},
>>         @{Hash=0038BE5708BB485D8CF13E290D65CE3C;
>>
>> FullName=S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sq
>> l_engine_core_shared_msi\Windows\Gac\itlr0xln.dll}}
>> ----------End PS Output---------
>>
>> BTW - the output poses the question: If I'm looking for only items that have 
>> a count of more than one, my am I getting the first name/value pair? I 
>> believe that it's because I'm not doing something correctly, but can't 
>> figure that out.
>>
>> If I try to export as a CSV file instead of letting it Format-List to the 
>> screen, I get this:
>>
>> ----------Begin CSV Output--------
>> "IsReadOnly","IsFixedSize","IsSynchronized","Keys","Values","SyncRoot","Count"
>> "False","False","False","System.Collections.Hashtable+KeyCollection","System.Collections.Hashtable+ValueCollection","System.Object","33"
>> ----------End CSV Ouput----------
>>
>> What I'd like to see is another CSV file, that looks like this the one
>> following - notice that the only difference is that the singleton is
>> gone:
>>
>> ----------Begin Data File----------
>> "Hash","Length","Fullname"
>> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll"
>> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll"
>> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll"
>> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll"
>> ----------End Data File---------
>>
>> Thanks again for any help.
>>
>> Kurt
>>
>> On Mon, Aug 3, 2015 at 5:25 PM, Kurt Buff <[email protected]> wrote:
>>> Replying to myself, since that seems the reasonable thing to do here.
>>>
>>> I've tested the following against a smaller directory that I know has
>>> some duplicates, and am getting progress. Here is what I have so far
>>> (work with the line wraps!):
>>>
>>> Get-ChildItem S:\ -File -Recurse | select fullname, length |
>>> Export-CSV -NoTypeInformation c:\temp\files.csv
>>>
>>> Import-CSV c:\temp\files.csv | Select-Object -Property
>>> @{Name="MD5";Expression={(Get-Filehash -algorithm md5
>>> $_.FullName).MD5}},Length,FullName | Export-CSV -NoTypeInformation
>>> c:\temp\filehash.csv
>>>
>>> Import-CSV C:\temp\checker\fileMD5.csv | Sort-Object
>>> @{Expression={$_.Length -as [int]}} | Export-CSV -NoTypeInformation
>>> c:\temp\checker\FileMD5Sorted.csv
>>>
>>> The above generates a file of 315286 lines (not including header) -
>>> of course, that's the number of files in the directory tree. I get
>>> output that looks like this (work with the line wraps again):
>>>
>>> "MD5","Length","FullName"
>>> "6467C3875955DF4514395F0AFCAAA62A","3182604288","S:\Infrastructure\Microsoft\OSes\Win7EntSP1_64bit\SW_DVD5_SA_Win_Ent_7w_SP1_64BIT_English_-2_MLF_X17-58882.ISO"
>>>
>>> I noticed two oddities, however:
>>>
>>> o- zero-length files generate a hash, and of course the hash is the
>>> same for all of them. I probably should have expected that, but it
>>> surprised me.
>>>
>>> o- I find a handful of files (22 of them) at the top of the csv file
>>> after sorting that don't seem to obey the sorting on the hash that
>>> the other files followed. It's very strange. They're not duplicates
>>> of any other files; their hashes and file sizes are out of sort order
>>> from all of the rest, AFAICT. I'm not sure what to make of that.
>>>
>>> But, ignoring those two things, I'd like to proceed a bit further:
>>>
>>> o- Writing to another file only those lines that are duplicate files,
>>> which I can do by selecting selecting the lines that have matching
>>> hashes (and possibly also matching sizes)
>>>
>>> o- Possibly adding another column, which would contain an integer
>>> that would increment for each set of matched files, which would
>>> probably lead to...
>>>
>>> o- Among other things, calculating the amount of duplicated space
>>> (sum of n-1 file sizes for each set of dupes), identifying duplicate
>>> directories that can be eliminated in toto, etc.
>>>
>>> But, I'm stymied on the execution of the logic. I'm such an
>>> inexperienced programmer that I'm flailing on the first of these
>>> steps. I believe I need to make a stepwise comparison of the MD5
>>> column, which I think would look something like this:
>>>
>>>      $dupe = 1
>>>      read infile.line1 into variable1
>>>      read infile.line2 into variable2
>>>      if {
>>>           variable1.MD5 -eq variable2.MD5
>>>           prefix variable1 with dupe counter
>>>           write variable1 to the new csv file
>>>           while not eof
>>>                set variable1 to the contents of variable2
>>>                read line next into variable2
>>>                compare variable1.MD5 to variable2.MD5
>>>           if match
>>>                prefix variable1 with $dupe
>>>                append variable1 as new line of new csv file
>>>           else
>>>                increment dupe counter
>>>      endwhile }
>>>     else {
>>>           while not eof
>>>                set variable1 to the contents of variable2
>>>                read line next into variable2
>>>                compare variable1.MD5 to variable2.MD5
>>>           if match
>>>                prefix variable1 with $dupe
>>>                append variable1 as new line of new csv file
>>>           else
>>>                increment dupe counter
>>>      endwhile
>>>
>>> I realize I could be way off base on the algorithm here, but that's
>>> what I've been able to dream up.
>>>
>>> Anyone care to critique and offer syntax suggestions - my googlefu is
>>> about exhausted.
>>>
>>> Kurt
>>>
>>> On Thu, Jul 30, 2015 at 12:45 PM, Kurt Buff <[email protected]> wrote:
>>>> I'm putting together what should be a simple little script, and failing.
>>>>
>>>> I am ultimately looking to run this against a directory, then sort
>>>> the output on the hash field and then parse for duplicates. There
>>>> are two conditions that concern me: 1) there are over 3m files in
>>>> the target directory, and 2) many of the files are quite large, over 1g.
>>>>
>>>> I'm more concerned about the effects of the script on memory than on
>>>> processor - the data is fairly static, and I intend to run it once a
>>>> month or even less, but I did choose MD5 as the hash algorithm for
>>>> speed, rather than accept the default of SHA256.
>>>>
>>>> This is pretty simple stuff, I'm sure, but I'm using this as a
>>>> learning exercise more than anything, as there are duplicate file
>>>> finders out in the world already.
>>>>
>>>> There are several problems with what I have put together so far,
>>>> which this this:
>>>>
>>>>      Get-ChildItem c:\stuff -Recurse | select length, fullname |
>>>> export-csv -NoTypeInformation c:\temp\files.csv
>>>>      Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash
>>>> -algorithm md5 $_.FullName) }; Length | Sort hash
>>>>
>>>> Using Length (or $_.Length) anywhere in the foreach statement gives
>>>> an error, or gives weird output.
>>>>
>>>> Sample Output when not using Length, and therefore getting
>>>> reasonable output (extra spaces and hyphen delimiters elided):
>>>>      Algorithm   Hash
>>>>         Path
>>>>      MD5          592BE1AD0ED83C36D5E68CA7A014A510   
>>>> C:\stuff\Tools\SomeFile.DOC
>>>>
>>>> What I'd like to see instead
>>>>      Hash                                                          Length  
>>>>  Path
>>>>      592BE1AD0ED83C36D5E68CA7A014A510    79872    
>>>> C:\stuff\Tools\SomeFile.DOC
>>>>
>>>> If anyone can offer some instruction, I'd appreciate it.
>>>>
>>>> Kurt
>>>
>>>
>>> ================================================
>>> Did you know you can also post and find answers on PowerShell in the forums?
>>> http://www.myitforum.com/forums/default.asp?catApp=1
>>>
>>
>>
>> ================================================
>> Did you know you can also post and find answers on PowerShell in the forums?
>> http://www.myitforum.com/forums/default.asp?catApp=1
>>
>>
>> ================================================
>> Did you know you can also post and find answers on PowerShell in the forums?
>> http://www.myitforum.com/forums/default.asp?catApp=1
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1
>
>
> ================================================
> Did you know you can also post and find answers on PowerShell in the forums?
> http://www.myitforum.com/forums/default.asp?catApp=1


================================================
Did you know you can also post and find answers on PowerShell in the forums?
http://www.myitforum.com/forums/default.asp?catApp=1

Re: [powershell] Re: Need some pointers on an exercise I've set for myself

Reply via email to