Oh, I see. Well if you must read and hash every byte of every file then you really are mostly measuring device speed.
From: <golang-nuts@googlegroups.com> on behalf of Sri G <sriakhil.gogin...@gmail.com> Date: Sunday, October 16, 2016 at 12:17 PM To: golang-nuts <golang-nuts@googlegroups.com> Cc: <sriakhil.gogin...@gmail.com> Subject: Re: [go-nuts] Re: Duplicate File Checker Performance This isn't exactly the same because I deleted some files but it shouldn't really matter. Switched to md5.. --- a/dup.go +++ b/dup.go @@ -3,7 +3,7 @@ package main - "crypto/sha256" + "crypto/md5" @@ -207,8 +207,8 @@ func main() { +type Hash [16]byte // appropriate for MD5 +// type Hash [32]byte // appropriate for SHA-256 func hashFile(p string, hash []byte, prefix int64) (count int64) { @@ -221,8 +221,8 @@ func hashFile(p string, hash []byte, prefix int64) (count int64) { + hasher := md5.New() // select MD5 in concert with "Hash" above + // hasher := sha256.New() // select SHA-256 in concert with "Hash" above Checking only same sized files is huge speed up (82x less bytes checked) 016/10/16 14:33:51 total: 566 files ( 100.00%), 1667774744 bytes ( 100.00%) 2016/10/16 14:33:51 examined: 9 files ( 1.59%), 20271440 bytes ( 1.22%) in 0.4209 seconds 2016/10/16 14:33:51 duplicates: 9 files ( 1.59%), 20271440 bytes ( 1.22%) Checking the first 4KB of files and only hashing if they are the same is another cool optimization (check avg. 768x less bytes in my case). Really nice Michael. With workers = 8: RAID10: 0.05s user 0.04s system 37% cpu 0.231 total, couldn't check memory usage but its probably negligible SSD: 0.05s user 0.04s system 59% cpu 0.137 total Since SSD's and my filesystem are optimized for 4K random reads, it makes sense to use multiple threads/goroutines. Optimal # of workers=9 on RAID 10: 0.05s user 0.04s system 40% cpu 0.220 total on SSD workers = 8~9: 0.04s user 0.04s system 68% cpu 0.117 total Not so much when you're doing a full sequential read. Because I use the md5 for other purposes, the entire file must be hashed, so sadly I cant use these optimizations. On Sunday, October 16, 2016 at 1:26:24 PM UTC-4, Michael Jones wrote: Sri G, How does this time compare to my “Dup” program? I can’t test for you…since it is your filesystem…but I thought I had it going about as fast as possible a few years ago when I wrote that one. https://github.com/MichaelTJones/dup Michael From: <golan...@googlegroups.com> on behalf of Sri G <sriakhil...@gmail.com> Date: Saturday, October 15, 2016 at 6:46 PM To: golang-nuts <golan...@googlegroups.com> Subject: [go-nuts] Re: Duplicate File Checker Performance Thanks. Made the go code similar to python using CopyBuffer with a block size of 65536. buf := make([]byte, 65536) if _, err := io.CopyBuffer(hash, file, buf); err != nil { fmt.Println(err) } Didn't make too much of a difference, was slightly faster. What got it to the same place was running ComputeHash in the same goroutine as the Walk function vs its own go routine for each file + ComputeHash(path, info, queue, wg) - go ComputeHash(path, info, queue, wg) 2.88s user 0.98s system 23% cpu 16.086 total, memory usage ~ 7MB Here's the before and after pprof webs: BEFORE with 'go ComputeHash(...): AFTER with 'ComputeHash(...): Since disk read are SOO much slower, computing the hash for each file in its own goroutine caused a huge slowdown.. btw this is on a RAID10, with SSD: Old code SSD: 3.31s user 17.87s system 244% cpu 8.667 total New code SDD: 2.88s user 0.84s system 69% cpu 5.369 total Shows you can throw hardware at a problem BUT the old code locks up my system momentarily.. On Saturday, October 15, 2016 at 3:27:38 PM UTC-4, Kevin Malachowski wrote: Sorry, I meant that calling Write on the hash type might be slower if it's called more often. (I'm on mobile right now. When I get back to a keyboard I'll try to come up with an example) -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.