[go-nuts] Duplicate File Checker Performance

Sri G Sat, 15 Oct 2016 02:16:36 -0700

I wrote a multi-threaded duplicate file checker using md5, here is the 
complete 
source: https://github.com/hbfs/dupe_check/blob/master/dupe_check.go


Benched two variants on the same machine, on the same set of files (~1.7GB 
folder with  ~600 files, each avg 3MB), multiple times, purging disk cache 
in between each run.

With this code:

    hash := md5.New()

    if _, err := io.Copy(hash, file); err != nil {
      fmt.Println(err)
    }

    var md5sum [md5.Size]byte
    copy(md5sum[:], hash.Sum(nil)[:16])

*// 3.35s user 105.20s system 213% cpu 50.848 total, memory usage is ~ 30MB*


With this code:

  data, err := ioutil.ReadFile(path)
  if err != nil {
    fmt.Println(err)
  }

  md5sum := md5.Sum(data)

* // 3.10s user 31.75s system 104% cpu 33.210 total, memory usage is ~ 
1.52GB*

The memory usage make sense, but why is the streaming version ~3x slower 
than the read the entire file into memory version? This trade off doesn't 
make sense to me since the file is being read from disk in both situations 
which should be the limiting factor. Then the md5sum is being computed. 

In the streaming version, there is an extra copy from []byte to [16]byte 
but that should be negligible.

My only theory I can think of is context switching

streaming version:
disk -> processor
processor waiting for disk read so it switch to read another file, sleeping 
the thread.

entire file:
disk -> memory -> processor
file is in memory so not as much context switching.

What do you think? Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[go-nuts] Duplicate File Checker Performance

Reply via email to