Too diagnose this issue, I tried some benchmarks with time tested tools: On the same directory:
find DIR -type f -exec md5 {} \; *5.36s user 2.93s system 50% cpu 16.552 total* Adding a hashmap on top of that wouldn't significantly increase the time. Making this multi-processed (32 processes): find DIR -type f -print0 | xargs -0 -n 1 -P 32 md5 *5.32s user 3.24s system 43% cpu 19.503 total* With 64 processes, like GOMAXPROCS=64 on this machine. find DIR -type f -print0 | xargs -0 -n 1 -P 64 md5 *5.31s user 3.66s system 42% cpu 20.999 total* So it seems disk access is the bottleneck as it should be and the biggest performance hit comes from the synchronization I wrote a python script to do the same, code is here: https://github.com/hbfs/dupe_check/blob/master/dupe_check.py *2.97s user 0.92s system 24% cpu 15.590 total, memory usage is ~ 8MB* My next step is to try a single threaded/goroutine version in Go to replicate this level of performance and get a deeper understand of how Go is built and how to use it more effectively. Advice appreciated! On Saturday, October 15, 2016 at 5:15:29 AM UTC-4, Sri G wrote: > > I wrote a multi-threaded duplicate file checker using md5, here is the > complete source: > https://github.com/hbfs/dupe_check/blob/master/dupe_check.go > > Benched two variants on the same machine, on the same set of files (~1.7GB > folder with ~600 files, each avg 3MB), multiple times, purging disk cache > in between each run. > > With this code: > > hash := md5.New() > > if _, err := io.Copy(hash, file); err != nil { > fmt.Println(err) > } > > var md5sum [md5.Size]byte > copy(md5sum[:], hash.Sum(nil)[:16]) > > *// 3.35s user 105.20s system 213% cpu 50.848 total, memory usage is ~ > 30MB* > > > With this code: > > data, err := ioutil.ReadFile(path) > if err != nil { > fmt.Println(err) > } > > md5sum := md5.Sum(data) > > * // 3.10s user 31.75s system 104% cpu 33.210 total, memory usage is ~ > 1.52GB* > > The memory usage make sense, but why is the streaming version ~3x slower > than the read the entire file into memory version? This trade off doesn't > make sense to me since the file is being read from disk in both situations > which should be the limiting factor. Then the md5sum is being computed. > > In the streaming version, there is an extra copy from []byte to [16]byte > but that should be negligible. > > My only theory I can think of is context switching > > streaming version: > disk -> processor > processor waiting for disk read so it switch to read another file, > sleeping the thread. > > entire file: > disk -> memory -> processor > file is in memory so not as much context switching. > > What do you think? Thanks! > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.