Re: [go-nuts] Re: Duplicate File Checker Performance

Sri G Fri, 21 Oct 2016 12:16:11 -0700

Yea :/

Appreciate you sharing your project and your code! I learned a lot of 
useful Go patterns (referencing a fixed size byte buffer as a slice) and 
how to re-use byte buffers like in the python version to keep memory usage 
down.


On Sunday, October 16, 2016 at 4:32:34 PM UTC-4, Michael Jones wrote:
>
> Oh, I see. Well if you must read and hash every byte of every file then 
> you really are mostly measuring device speed.
>
>  
>
> *From: *<golan...@googlegroups.com <javascript:>> on behalf of Sri G <
> sriakhil...@gmail.com <javascript:>>
> *Date: *Sunday, October 16, 2016 at 12:17 PM
> *To: *golang-nuts <golan...@googlegroups.com <javascript:>>
> *Cc: *<sriakhil...@gmail.com <javascript:>>
> *Subject: *Re: [go-nuts] Re: Duplicate File Checker Performance
>
>  
>
> This isn't exactly the same because I deleted some files but it shouldn't 
> really matter.  
>
>  
>
> Switched to md5..
>
>  
>
> --- a/dup.go
>
> +++ b/dup.go
>
> @@ -3,7 +3,7 @@ package main
>
> -       "crypto/sha256"
>
> +       "crypto/md5"
>
>  
>
> @@ -207,8 +207,8 @@ func main() {
>
> +type Hash [16]byte // appropriate for MD5
>
> +// type Hash [32]byte // appropriate for SHA-256
>
>  
>
>  func hashFile(p string, hash []byte, prefix int64) (count int64) {
>
> @@ -221,8 +221,8 @@ func hashFile(p string, hash []byte, prefix int64) 
> (count int64) {
>
> +       hasher := md5.New() // select MD5 in concert with "Hash" above
>
> +       // hasher := sha256.New() // select SHA-256 in concert with "Hash" 
> above
>
>  
>
>  
>
> Checking only same sized files is huge speed up (82x less bytes checked)
>
>  
>
> 016/10/16  14:33:51      total:      566 files ( 100.00%),    1667774744 
> bytes ( 100.00%)
>
> 2016/10/16 14:33:51   examined:        9 files (   1.59%),      20271440 
> bytes (   1.22%) in 0.4209 seconds
>
> 2016/10/16 14:33:51 duplicates:        9 files (   1.59%),      20271440 
> bytes (   1.22%)
>
>  
>
> Checking the first 4KB of files and only hashing if they are the same is 
> another cool optimization (check avg. 768x less bytes in my case). Really 
> nice Michael.
>
>  
>
> With workers = 8:
>
>  
>
> RAID10:* 0.05s user 0.04s system 37% cpu 0.231 total, couldn't check 
> memory usage but its probably negligible*
>
> SSD:    *0.05s user 0.04s system 59% cpu 0.137 total*
>
>  
>
> Since SSD's and my filesystem are optimized for 4K random reads, it makes 
> sense to use multiple threads/goroutines.
>
>  
>
> Optimal # of workers=9 on RAID 10: *0.05s user 0.04s system 40% cpu 0.220 
> total*
>
> on SSD workers = 8~9:              *0.04s user 0.04s system 68% cpu 0.117 
> total*
>
>  
>
> Not so much when you're doing a full sequential read. Because I use the 
> md5 for other purposes, the entire file must be hashed, so sadly I cant use 
> these optimizations.
>
>
> On Sunday, October 16, 2016 at 1:26:24 PM UTC-4, Michael Jones wrote:
>
> Sri G,
>
>  
>
> How does this time compare to my “Dup” program? I can’t test for you…since 
> it is your filesystem…but I thought I had it going about as fast as 
> possible a few years ago when I wrote that one.
>
>  
>
> https://github.com/MichaelTJones/dup
>
>  
>
> Michael
>
>  
>
> *From: *<golan...@googlegroups.com> on behalf of Sri G <
> sriakhil...@gmail.com>
> *Date: *Saturday, October 15, 2016 at 6:46 PM
> *To: *golang-nuts <golan...@googlegroups.com>
> *Subject: *[go-nuts] Re: Duplicate File Checker Performance
>
>  
>
> Thanks. Made the go code similar to python using CopyBuffer with a block 
> size of 65536. 
>
>  
>
>     buf := make([]byte, 65536)
>
>     
>
>     if _, err := io.CopyBuffer(hash, file, buf); err != nil {
>
>         fmt.Println(err)
>
>     }
>
>  
>
> Didn't make too much of a difference, was slightly faster.
>
>  
>
> What got it to the same place was running ComputeHash in the same 
> goroutine as the Walk function vs its own go routine for each file
>
>  
>
> +    ComputeHash(path, info, queue, wg)
>
> -    go ComputeHash(path, info, queue, wg)
>
>  
>
>  
>
> *2.88s user 0.98s system 23% cpu 16.086 total, memory usage ~ 7MB*
>
> Here's the before and after pprof webs:
>
>  
>
> BEFORE with 'go ComputeHash(...):
>
>  
>
>
> <https://lh3.googleusercontent.com/-aRKwq1P9_ec/WALXpkl_yxI/AAAAAAAADFY/WXn0PDcOw_Mk909yNp9Hh1tWUl0PlSVJACLcB/s1600/prof.cpu-with-go_compute_hash.png>
>
>  
>
>  
>
> AFTER with 'ComputeHash(...):
>
>  
>
>
> <https://lh3.googleusercontent.com/-8LnYMr_UhOg/WALXsbuxkjI/AAAAAAAADFc/EAk7vOvl2zMJZARfcz2JpgXmZc_3YfFKwCLcB/s1600/prof.cpu-no-go.png>
>
>  
>
>  
>
> Since disk read are SOO much slower, computing the hash for each file in 
> its own goroutine caused a huge slowdown.. 
>
>  
>
> btw this is on a RAID10, with SSD: 
>
>  
>
> Old code SSD:* 3.31s user 17.87s system 244% cpu 8.667 total*
>
>  
>
> New code SDD:* 2.88s user 0.84s system 69% cpu 5.369 total*
>
>  
>
> Shows you can throw hardware at a problem BUT the old code locks up my 
> system momentarily..
>
>  
>
>
> On Saturday, October 15, 2016 at 3:27:38 PM UTC-4, Kevin Malachowski wrote:
>
> Sorry, I meant that calling Write on the hash type might be slower if it's 
> called more often.
>
> (I'm on mobile right now. When I get back to a keyboard I'll try to come 
> up with an example)
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Duplicate File Checker Performance

Reply via email to