Too diagnose this issue, I tried some benchmarks with time tested tools:

On the same directory:

find DIR -type f -exec md5 {} \; 

*5.36s user 2.93s system 50% cpu 16.552 total*

Adding a hashmap on top of that wouldn't significantly increase the time.

Making this multi-processed (32 processes): 

find DIR -type f -print0 | xargs -0 -n 1 -P 32 md5

*5.32s user 3.24s system 43% cpu 19.503 total*

With 64 processes, like GOMAXPROCS=64 on this machine.

find DIR -type f -print0 | xargs -0 -n 1 -P 64 md5


*5.31s user 3.66s system 42% cpu 20.999 total*
So it seems disk access is the bottleneck as it should be and the biggest 
performance hit comes from the synchronization

I wrote a python script to do the same, code is 
here: https://github.com/hbfs/dupe_check/blob/master/dupe_check.py

*2.97s user 0.92s system 24% cpu 15.590 total, memory usage is ~ 8MB*

My next step is to try a single threaded/goroutine version in Go to 
replicate this level of performance and get a deeper understand of how Go 
is built and how to use it more effectively. Advice appreciated!

On Saturday, October 15, 2016 at 5:15:29 AM UTC-4, Sri G wrote:
>
> I wrote a multi-threaded duplicate file checker using md5, here is the 
> complete source: 
> https://github.com/hbfs/dupe_check/blob/master/dupe_check.go
>
> Benched two variants on the same machine, on the same set of files (~1.7GB 
> folder with  ~600 files, each avg 3MB), multiple times, purging disk cache 
> in between each run.
>
> With this code:
>
>     hash := md5.New()
>
>     if _, err := io.Copy(hash, file); err != nil {
>       fmt.Println(err)
>     }
>
>     var md5sum [md5.Size]byte
>     copy(md5sum[:], hash.Sum(nil)[:16])
>
> *// 3.35s user 105.20s system 213% cpu 50.848 total, memory usage is ~ 
> 30MB*
>
>
> With this code:
>
>   data, err := ioutil.ReadFile(path)
>   if err != nil {
>     fmt.Println(err)
>   }
>
>   md5sum := md5.Sum(data)
>
> * // 3.10s user 31.75s system 104% cpu 33.210 total, memory usage is ~ 
> 1.52GB*
>
> The memory usage make sense, but why is the streaming version ~3x slower 
> than the read the entire file into memory version? This trade off doesn't 
> make sense to me since the file is being read from disk in both situations 
> which should be the limiting factor. Then the md5sum is being computed. 
>
> In the streaming version, there is an extra copy from []byte to [16]byte 
> but that should be negligible.
>
> My only theory I can think of is context switching
>
> streaming version:
> disk -> processor
> processor waiting for disk read so it switch to read another file, 
> sleeping the thread.
>
> entire file:
> disk -> memory -> processor
> file is in memory so not as much context switching.
>
> What do you think? Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to