This isn't exactly the same because I deleted some files but it shouldn't 
really matter.  

Switched to md5..

--- a/dup.go
+++ b/dup.go
@@ -3,7 +3,7 @@ package main
-       "crypto/sha256"
+       "crypto/md5"

@@ -207,8 +207,8 @@ func main() {
+type Hash [16]byte // appropriate for MD5
+// type Hash [32]byte // appropriate for SHA-256

 func hashFile(p string, hash []byte, prefix int64) (count int64) {
@@ -221,8 +221,8 @@ func hashFile(p string, hash []byte, prefix int64) 
(count int64) {
+       hasher := md5.New() // select MD5 in concert with "Hash" above
+       // hasher := sha256.New() // select SHA-256 in concert with "Hash" 
above


Checking only same sized files is huge speed up (82x less bytes checked)

016/10/16  14:33:51      total:      566 files ( 100.00%),    1667774744 
bytes ( 100.00%)
2016/10/16 14:33:51   examined:        9 files (   1.59%),      20271440 
bytes (   1.22%) in 0.4209 seconds
2016/10/16 14:33:51 duplicates:        9 files (   1.59%),      20271440 
bytes (   1.22%)

Checking the first 4KB of files and only hashing if they are the same is 
another cool optimization (check avg. 768x less bytes in my case). Really 
nice Michael.

With workers = 8:

RAID10:* 0.05s user 0.04s system 37% cpu 0.231 total, couldn't check memory 
usage but its probably negligible*
SSD:    *0.05s user 0.04s system 59% cpu 0.137 total*

Since SSD's and my filesystem are optimized for 4K random reads, it makes 
sense to use multiple threads/goroutines.

Optimal # of workers=9 on RAID 10: *0.05s user 0.04s system 40% cpu 0.220 
total*
on SSD workers = 8~9:              *0.04s user 0.04s system 68% cpu 0.117 
total*

Not so much when you're doing a full sequential read. Because I use the md5 
for other purposes, the entire file must be hashed, so sadly I cant use 
these optimizations.

On Sunday, October 16, 2016 at 1:26:24 PM UTC-4, Michael Jones wrote:
>
> Sri G,
>
>  
>
> How does this time compare to my “Dup” program? I can’t test for you…since 
> it is your filesystem…but I thought I had it going about as fast as 
> possible a few years ago when I wrote that one.
>
>  
>
> https://github.com/MichaelTJones/dup
>
>  
>
> Michael
>
>  
>
> *From: *<golan...@googlegroups.com <javascript:>> on behalf of Sri G <
> sriakhil...@gmail.com <javascript:>>
> *Date: *Saturday, October 15, 2016 at 6:46 PM
> *To: *golang-nuts <golan...@googlegroups.com <javascript:>>
> *Subject: *[go-nuts] Re: Duplicate File Checker Performance
>
>  
>
> Thanks. Made the go code similar to python using CopyBuffer with a block 
> size of 65536. 
>
>  
>
>     buf := make([]byte, 65536)
>
>     
>
>     if _, err := io.CopyBuffer(hash, file, buf); err != nil {
>
>         fmt.Println(err)
>
>     }
>
>  
>
> Didn't make too much of a difference, was slightly faster.
>
>  
>
> What got it to the same place was running ComputeHash in the same 
> goroutine as the Walk function vs its own go routine for each file
>
>  
>
> +    ComputeHash(path, info, queue, wg)
>
> -    go ComputeHash(path, info, queue, wg)
>
>  
>
>  
>
> *2.88s user 0.98s system 23% cpu 16.086 total, memory usage ~ 7MB*
>
> Here's the before and after pprof webs:
>
>  
>
> BEFORE with 'go ComputeHash(...):
>
>  
>
>
> <https://lh3.googleusercontent.com/-aRKwq1P9_ec/WALXpkl_yxI/AAAAAAAADFY/WXn0PDcOw_Mk909yNp9Hh1tWUl0PlSVJACLcB/s1600/prof.cpu-with-go_compute_hash.png>
>
>  
>
>  
>
> AFTER with 'ComputeHash(...):
>
>  
>
>
> <https://lh3.googleusercontent.com/-8LnYMr_UhOg/WALXsbuxkjI/AAAAAAAADFc/EAk7vOvl2zMJZARfcz2JpgXmZc_3YfFKwCLcB/s1600/prof.cpu-no-go.png>
>
>  
>
>  
>
> Since disk read are SOO much slower, computing the hash for each file in 
> its own goroutine caused a huge slowdown.. 
>
>  
>
> btw this is on a RAID10, with SSD: 
>
>  
>
> Old code SSD:* 3.31s user 17.87s system 244% cpu 8.667 total*
>
>  
>
> New code SDD:* 2.88s user 0.84s system 69% cpu 5.369 total*
>
>  
>
> Shows you can throw hardware at a problem BUT the old code locks up my 
> system momentarily..
>
>  
>
>
> On Saturday, October 15, 2016 at 3:27:38 PM UTC-4, Kevin Malachowski wrote:
>
> Sorry, I meant that calling Write on the hash type might be slower if it's 
> called more often.
>
> (I'm on mobile right now. When I get back to a keyboard I'll try to come 
> up with an example)
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to