Yea :/ Appreciate you sharing your project and your code! I learned a lot of useful Go patterns (referencing a fixed size byte buffer as a slice) and how to re-use byte buffers like in the python version to keep memory usage down.
On Sunday, October 16, 2016 at 4:32:34 PM UTC-4, Michael Jones wrote: > > Oh, I see. Well if you must read and hash every byte of every file then > you really are mostly measuring device speed. > > > > *From: *<golan...@googlegroups.com <javascript:>> on behalf of Sri G < > sriakhil...@gmail.com <javascript:>> > *Date: *Sunday, October 16, 2016 at 12:17 PM > *To: *golang-nuts <golan...@googlegroups.com <javascript:>> > *Cc: *<sriakhil...@gmail.com <javascript:>> > *Subject: *Re: [go-nuts] Re: Duplicate File Checker Performance > > > > This isn't exactly the same because I deleted some files but it shouldn't > really matter. > > > > Switched to md5.. > > > > --- a/dup.go > > +++ b/dup.go > > @@ -3,7 +3,7 @@ package main > > - "crypto/sha256" > > + "crypto/md5" > > > > @@ -207,8 +207,8 @@ func main() { > > +type Hash [16]byte // appropriate for MD5 > > +// type Hash [32]byte // appropriate for SHA-256 > > > > func hashFile(p string, hash []byte, prefix int64) (count int64) { > > @@ -221,8 +221,8 @@ func hashFile(p string, hash []byte, prefix int64) > (count int64) { > > + hasher := md5.New() // select MD5 in concert with "Hash" above > > + // hasher := sha256.New() // select SHA-256 in concert with "Hash" > above > > > > > > Checking only same sized files is huge speed up (82x less bytes checked) > > > > 016/10/16 14:33:51 total: 566 files ( 100.00%), 1667774744 > bytes ( 100.00%) > > 2016/10/16 14:33:51 examined: 9 files ( 1.59%), 20271440 > bytes ( 1.22%) in 0.4209 seconds > > 2016/10/16 14:33:51 duplicates: 9 files ( 1.59%), 20271440 > bytes ( 1.22%) > > > > Checking the first 4KB of files and only hashing if they are the same is > another cool optimization (check avg. 768x less bytes in my case). Really > nice Michael. > > > > With workers = 8: > > > > RAID10:* 0.05s user 0.04s system 37% cpu 0.231 total, couldn't check > memory usage but its probably negligible* > > SSD: *0.05s user 0.04s system 59% cpu 0.137 total* > > > > Since SSD's and my filesystem are optimized for 4K random reads, it makes > sense to use multiple threads/goroutines. > > > > Optimal # of workers=9 on RAID 10: *0.05s user 0.04s system 40% cpu 0.220 > total* > > on SSD workers = 8~9: *0.04s user 0.04s system 68% cpu 0.117 > total* > > > > Not so much when you're doing a full sequential read. Because I use the > md5 for other purposes, the entire file must be hashed, so sadly I cant use > these optimizations. > > > On Sunday, October 16, 2016 at 1:26:24 PM UTC-4, Michael Jones wrote: > > Sri G, > > > > How does this time compare to my “Dup” program? I can’t test for you…since > it is your filesystem…but I thought I had it going about as fast as > possible a few years ago when I wrote that one. > > > > https://github.com/MichaelTJones/dup > > > > Michael > > > > *From: *<golan...@googlegroups.com> on behalf of Sri G < > sriakhil...@gmail.com> > *Date: *Saturday, October 15, 2016 at 6:46 PM > *To: *golang-nuts <golan...@googlegroups.com> > *Subject: *[go-nuts] Re: Duplicate File Checker Performance > > > > Thanks. Made the go code similar to python using CopyBuffer with a block > size of 65536. > > > > buf := make([]byte, 65536) > > > > if _, err := io.CopyBuffer(hash, file, buf); err != nil { > > fmt.Println(err) > > } > > > > Didn't make too much of a difference, was slightly faster. > > > > What got it to the same place was running ComputeHash in the same > goroutine as the Walk function vs its own go routine for each file > > > > + ComputeHash(path, info, queue, wg) > > - go ComputeHash(path, info, queue, wg) > > > > > > *2.88s user 0.98s system 23% cpu 16.086 total, memory usage ~ 7MB* > > Here's the before and after pprof webs: > > > > BEFORE with 'go ComputeHash(...): > > > > > <https://lh3.googleusercontent.com/-aRKwq1P9_ec/WALXpkl_yxI/AAAAAAAADFY/WXn0PDcOw_Mk909yNp9Hh1tWUl0PlSVJACLcB/s1600/prof.cpu-with-go_compute_hash.png> > > > > > > AFTER with 'ComputeHash(...): > > > > > <https://lh3.googleusercontent.com/-8LnYMr_UhOg/WALXsbuxkjI/AAAAAAAADFc/EAk7vOvl2zMJZARfcz2JpgXmZc_3YfFKwCLcB/s1600/prof.cpu-no-go.png> > > > > > > Since disk read are SOO much slower, computing the hash for each file in > its own goroutine caused a huge slowdown.. > > > > btw this is on a RAID10, with SSD: > > > > Old code SSD:* 3.31s user 17.87s system 244% cpu 8.667 total* > > > > New code SDD:* 2.88s user 0.84s system 69% cpu 5.369 total* > > > > Shows you can throw hardware at a problem BUT the old code locks up my > system momentarily.. > > > > > On Saturday, October 15, 2016 at 3:27:38 PM UTC-4, Kevin Malachowski wrote: > > Sorry, I meant that calling Write on the hash type might be slower if it's > called more often. > > (I'm on mobile right now. When I get back to a keyboard I'll try to come > up with an example) > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts...@googlegroups.com <javascript:>. > For more options, visit https://groups.google.com/d/optout. > > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.