Sigh. I fubared the first line of the script, when changing it to test against a bigger directory.
This: Get-ChildItem "g:\groups information technology" -File -Recurse | select length, fullname | export-csv -NoTypeInformation c:\temp\IT-files.csv Should be this: Get-ChildItem "g:\groups\information technology" -File -Recurse | select length, fullname | export-csv -NoTypeInformation c:\temp\IT-files.csv However, I do know that the partition that I'm going to be running this against 'for real' does contain some really deep trees, because I run a batch file to catch them, and have nagged the owners of those directories for years regarding the problems they are experiencing. Kurt On Wed, Aug 5, 2015 at 10:33 AM, Michael B. Smith <[email protected]> wrote: > Where is that error coming from? When you generate the file list? > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > On Behalf Of Kurt Buff > Sent: Wednesday, August 5, 2015 1:26 PM > To: [email protected] > Subject: Re: [powershell] Re: Need some pointers on an exercise I've set for > myself > > Many thanks for this - I believe I have understood your explanation, and I've > updated my script, and it works. > > Unfortunately, now I have to deal with path length problems. I'm getting lots > of the following message: > > Get-ChildItem : The specified path, file name, or both are too long. > The fully qualified file name must be less than 260 characters, and the > directory name must be less than 248 characters. > > I'm doing some research on this now, but I'm guessing this will cause > problems throughout the script. > > I've found this article, which seems relatively comprehensive, and am > following up on the alternatives it provides. > > Kurt > > On Tue, Aug 4, 2015 at 7:07 PM, Michael B. Smith <[email protected]> > wrote: >> You don't want to Group-Object with -AsHashTable. That means the object >> passed to Where-Object is a hash table which contains the results of both >> hashes. Which has a count of 2 since there are 2 unique hashes in the input >> file. >> >> This gives you the grouping you want: >> >> Import-Csv C:\temp\FileHashSortedOnHash.csv | Group-Object >> -property Hash | Where-Object { $_.count -gt 1 } >> >> To get the output you want, you do this: >> >> Import-Csv C:\temp\FileHashSortedOnHash.csv | Group-Object >> -property Hash | Where-Object { $_.count -gt 1 } | Select -Expand >> Group | Export-Csv -NoTypeInformation c:\temp\test.csv >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Kurt Buff >> Sent: Tuesday, August 4, 2015 8:29 PM >> To: [email protected] >> Subject: Re: [powershell] Re: Need some pointers on an exercise I've >> set for myself >> >> OK - once more replying to myself. Trying to work with a hashtable, and I >> know I'm doing something really silly, but after looking through the web, >> and my two powershell books, I'm not figuring it out. >> >> The datafile looks like this (obviously much truncated from the >> 3mb/31k line file that contains the listing for the entire partition, >> but notice that I have a singleton entry, and then 4 entries that are >> the same size, but with the same hash, though the file names are >> different): >> >> ----------Begin Data File---------- >> "Hash","Length","Fullname" >> "004B863286B718C1FAC71D017C20BEE2","41342","S:\Infrastructure\Microsoft\Office\1997\CLIPART\PHOTOS\FAMILY\WALKING1.JPG" >> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll" >> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll" >> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll" >> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll" >> ----------End Data File--------- >> >> If I run this command against it: >> Import-Csv C:\temp\FileHashSortedOnHash.csv | Select-Object Hash, FullName | >> Group-Object -property Hash -AsHashTable | Where-Object { $_.count -gt 1 } | >> fl I get the following output: >> >> ----------Begin PS Output--------- >> Name : 004B863286B718C1FAC71D017C20BEE2 Value : >> {@{Hash=004B863286B718C1FAC71D017C20BEE2; >> FullName=S:\Infrastructure\Microsoft\Office\1997\CLIPART\PHOTOS\FAMILY >> \WALKING1.JPG}} >> >> Name : 0038BE5708BB485D8CF13E290D65CE3C Value : >> {@{Hash=0038BE5708BB485D8CF13E290D65CE3C; >> FullName=S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll}, >> @{Hash=0038BE5708BB485D8CF13E290D65CE3C; >> >> FullName=S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll}, >> @{Hash=0038BE5708BB485D8CF13E290D65CE3C; >> >> FullName=S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll}, >> @{Hash=0038BE5708BB485D8CF13E290D65CE3C; >> >> FullName=S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sq >> l_engine_core_shared_msi\Windows\Gac\itlr0xln.dll}} >> ----------End PS Output--------- >> >> BTW - the output poses the question: If I'm looking for only items that have >> a count of more than one, my am I getting the first name/value pair? I >> believe that it's because I'm not doing something correctly, but can't >> figure that out. >> >> If I try to export as a CSV file instead of letting it Format-List to the >> screen, I get this: >> >> ----------Begin CSV Output-------- >> "IsReadOnly","IsFixedSize","IsSynchronized","Keys","Values","SyncRoot","Count" >> "False","False","False","System.Collections.Hashtable+KeyCollection","System.Collections.Hashtable+ValueCollection","System.Object","33" >> ----------End CSV Ouput---------- >> >> What I'd like to see is another CSV file, that looks like this the one >> following - notice that the only difference is that the singleton is >> gone: >> >> ----------Begin Data File---------- >> "Hash","Length","Fullname" >> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll" >> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll" >> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\BusinessApps\sql_server_std_2008\ia64\Setup\sql_engine_core_shared_msi\PFiles\SqlServr\100\DTS\FEEnmer\qc5laxpt.dll" >> "0038BE5708BB485D8CF13E290D65CE3C","38752","S:\Infrastructure\Microsoft\SQL\2008R2-Standard\ia64\Setup\sql_engine_core_shared_msi\Windows\Gac\itlr0xln.dll" >> ----------End Data File--------- >> >> Thanks again for any help. >> >> Kurt >> >> On Mon, Aug 3, 2015 at 5:25 PM, Kurt Buff <[email protected]> wrote: >>> Replying to myself, since that seems the reasonable thing to do here. >>> >>> I've tested the following against a smaller directory that I know has >>> some duplicates, and am getting progress. Here is what I have so far >>> (work with the line wraps!): >>> >>> Get-ChildItem S:\ -File -Recurse | select fullname, length | >>> Export-CSV -NoTypeInformation c:\temp\files.csv >>> >>> Import-CSV c:\temp\files.csv | Select-Object -Property >>> @{Name="MD5";Expression={(Get-Filehash -algorithm md5 >>> $_.FullName).MD5}},Length,FullName | Export-CSV -NoTypeInformation >>> c:\temp\filehash.csv >>> >>> Import-CSV C:\temp\checker\fileMD5.csv | Sort-Object >>> @{Expression={$_.Length -as [int]}} | Export-CSV -NoTypeInformation >>> c:\temp\checker\FileMD5Sorted.csv >>> >>> The above generates a file of 315286 lines (not including header) - >>> of course, that's the number of files in the directory tree. I get >>> output that looks like this (work with the line wraps again): >>> >>> "MD5","Length","FullName" >>> "6467C3875955DF4514395F0AFCAAA62A","3182604288","S:\Infrastructure\Microsoft\OSes\Win7EntSP1_64bit\SW_DVD5_SA_Win_Ent_7w_SP1_64BIT_English_-2_MLF_X17-58882.ISO" >>> >>> I noticed two oddities, however: >>> >>> o- zero-length files generate a hash, and of course the hash is the >>> same for all of them. I probably should have expected that, but it >>> surprised me. >>> >>> o- I find a handful of files (22 of them) at the top of the csv file >>> after sorting that don't seem to obey the sorting on the hash that >>> the other files followed. It's very strange. They're not duplicates >>> of any other files; their hashes and file sizes are out of sort order >>> from all of the rest, AFAICT. I'm not sure what to make of that. >>> >>> But, ignoring those two things, I'd like to proceed a bit further: >>> >>> o- Writing to another file only those lines that are duplicate files, >>> which I can do by selecting selecting the lines that have matching >>> hashes (and possibly also matching sizes) >>> >>> o- Possibly adding another column, which would contain an integer >>> that would increment for each set of matched files, which would >>> probably lead to... >>> >>> o- Among other things, calculating the amount of duplicated space >>> (sum of n-1 file sizes for each set of dupes), identifying duplicate >>> directories that can be eliminated in toto, etc. >>> >>> But, I'm stymied on the execution of the logic. I'm such an >>> inexperienced programmer that I'm flailing on the first of these >>> steps. I believe I need to make a stepwise comparison of the MD5 >>> column, which I think would look something like this: >>> >>> $dupe = 1 >>> read infile.line1 into variable1 >>> read infile.line2 into variable2 >>> if { >>> variable1.MD5 -eq variable2.MD5 >>> prefix variable1 with dupe counter >>> write variable1 to the new csv file >>> while not eof >>> set variable1 to the contents of variable2 >>> read line next into variable2 >>> compare variable1.MD5 to variable2.MD5 >>> if match >>> prefix variable1 with $dupe >>> append variable1 as new line of new csv file >>> else >>> increment dupe counter >>> endwhile } >>> else { >>> while not eof >>> set variable1 to the contents of variable2 >>> read line next into variable2 >>> compare variable1.MD5 to variable2.MD5 >>> if match >>> prefix variable1 with $dupe >>> append variable1 as new line of new csv file >>> else >>> increment dupe counter >>> endwhile >>> >>> I realize I could be way off base on the algorithm here, but that's >>> what I've been able to dream up. >>> >>> Anyone care to critique and offer syntax suggestions - my googlefu is >>> about exhausted. >>> >>> Kurt >>> >>> On Thu, Jul 30, 2015 at 12:45 PM, Kurt Buff <[email protected]> wrote: >>>> I'm putting together what should be a simple little script, and failing. >>>> >>>> I am ultimately looking to run this against a directory, then sort >>>> the output on the hash field and then parse for duplicates. There >>>> are two conditions that concern me: 1) there are over 3m files in >>>> the target directory, and 2) many of the files are quite large, over 1g. >>>> >>>> I'm more concerned about the effects of the script on memory than on >>>> processor - the data is fairly static, and I intend to run it once a >>>> month or even less, but I did choose MD5 as the hash algorithm for >>>> speed, rather than accept the default of SHA256. >>>> >>>> This is pretty simple stuff, I'm sure, but I'm using this as a >>>> learning exercise more than anything, as there are duplicate file >>>> finders out in the world already. >>>> >>>> There are several problems with what I have put together so far, >>>> which this this: >>>> >>>> Get-ChildItem c:\stuff -Recurse | select length, fullname | >>>> export-csv -NoTypeInformation c:\temp\files.csv >>>> Import-CSV C:\temp\files.csv | ForEach-Object { (get-filehash >>>> -algorithm md5 $_.FullName) }; Length | Sort hash >>>> >>>> Using Length (or $_.Length) anywhere in the foreach statement gives >>>> an error, or gives weird output. >>>> >>>> Sample Output when not using Length, and therefore getting >>>> reasonable output (extra spaces and hyphen delimiters elided): >>>> Algorithm Hash >>>> Path >>>> MD5 592BE1AD0ED83C36D5E68CA7A014A510 >>>> C:\stuff\Tools\SomeFile.DOC >>>> >>>> What I'd like to see instead >>>> Hash Length >>>> Path >>>> 592BE1AD0ED83C36D5E68CA7A014A510 79872 >>>> C:\stuff\Tools\SomeFile.DOC >>>> >>>> If anyone can offer some instruction, I'd appreciate it. >>>> >>>> Kurt >>> >>> >>> ================================================ >>> Did you know you can also post and find answers on PowerShell in the forums? >>> http://www.myitforum.com/forums/default.asp?catApp=1 >>> >> >> >> ================================================ >> Did you know you can also post and find answers on PowerShell in the forums? >> http://www.myitforum.com/forums/default.asp?catApp=1 >> >> >> ================================================ >> Did you know you can also post and find answers on PowerShell in the forums? >> http://www.myitforum.com/forums/default.asp?catApp=1 > > > ================================================ > Did you know you can also post and find answers on PowerShell in the forums? > http://www.myitforum.com/forums/default.asp?catApp=1 > > > ================================================ > Did you know you can also post and find answers on PowerShell in the forums? > http://www.myitforum.com/forums/default.asp?catApp=1 ================================================ Did you know you can also post and find answers on PowerShell in the forums? http://www.myitforum.com/forums/default.asp?catApp=1
