Oh, I see. Well if you must read and hash every byte of every file then you 
really are mostly measuring device speed.

 

From: <golang-nuts@googlegroups.com> on behalf of Sri G 
<sriakhil.gogin...@gmail.com>
Date: Sunday, October 16, 2016 at 12:17 PM
To: golang-nuts <golang-nuts@googlegroups.com>
Cc: <sriakhil.gogin...@gmail.com>
Subject: Re: [go-nuts] Re: Duplicate File Checker Performance

 

This isn't exactly the same because I deleted some files but it shouldn't 
really matter.  

 

Switched to md5..

 

--- a/dup.go

+++ b/dup.go

@@ -3,7 +3,7 @@ package main

-       "crypto/sha256"

+       "crypto/md5"

 

@@ -207,8 +207,8 @@ func main() {

+type Hash [16]byte // appropriate for MD5

+// type Hash [32]byte // appropriate for SHA-256

 

 func hashFile(p string, hash []byte, prefix int64) (count int64) {

@@ -221,8 +221,8 @@ func hashFile(p string, hash []byte, prefix int64) (count 
int64) {

+       hasher := md5.New() // select MD5 in concert with "Hash" above

+       // hasher := sha256.New() // select SHA-256 in concert with "Hash" above

 

 

Checking only same sized files is huge speed up (82x less bytes checked)

 

016/10/16  14:33:51      total:      566 files ( 100.00%),    1667774744 bytes 
( 100.00%)

2016/10/16 14:33:51   examined:        9 files (   1.59%),      20271440 bytes 
(   1.22%) in 0.4209 seconds

2016/10/16 14:33:51 duplicates:        9 files (   1.59%),      20271440 bytes 
(   1.22%)

 

Checking the first 4KB of files and only hashing if they are the same is 
another cool optimization (check avg. 768x less bytes in my case). Really nice 
Michael.

 

With workers = 8:

 

RAID10: 0.05s user 0.04s system 37% cpu 0.231 total, couldn't check memory 
usage but its probably negligible

SSD:    0.05s user 0.04s system 59% cpu 0.137 total

 

Since SSD's and my filesystem are optimized for 4K random reads, it makes sense 
to use multiple threads/goroutines.

 

Optimal # of workers=9 on RAID 10: 0.05s user 0.04s system 40% cpu 0.220 total

on SSD workers = 8~9:              0.04s user 0.04s system 68% cpu 0.117 total

 

Not so much when you're doing a full sequential read. Because I use the md5 for 
other purposes, the entire file must be hashed, so sadly I cant use these 
optimizations.


On Sunday, October 16, 2016 at 1:26:24 PM UTC-4, Michael Jones wrote:

Sri G,

 

How does this time compare to my “Dup” program? I can’t test for you…since it 
is your filesystem…but I thought I had it going about as fast as possible a few 
years ago when I wrote that one.

 

https://github.com/MichaelTJones/dup

 

Michael

 

From: <golan...@googlegroups.com> on behalf of Sri G <sriakhil...@gmail.com>
Date: Saturday, October 15, 2016 at 6:46 PM
To: golang-nuts <golan...@googlegroups.com>
Subject: [go-nuts] Re: Duplicate File Checker Performance

 

Thanks. Made the go code similar to python using CopyBuffer with a block size 
of 65536. 

 

    buf := make([]byte, 65536)

    

    if _, err := io.CopyBuffer(hash, file, buf); err != nil {

        fmt.Println(err)

    }

 

Didn't make too much of a difference, was slightly faster.

 

What got it to the same place was running ComputeHash in the same goroutine as 
the Walk function vs its own go routine for each file

 

+    ComputeHash(path, info, queue, wg)

-    go ComputeHash(path, info, queue, wg)

 

 

2.88s user 0.98s system 23% cpu 16.086 total, memory usage ~ 7MB

Here's the before and after pprof webs:

 

BEFORE with 'go ComputeHash(...):

 

 

 

AFTER with 'ComputeHash(...):

 

 

 

Since disk read are SOO much slower, computing the hash for each file in its 
own goroutine caused a huge slowdown.. 

 

btw this is on a RAID10, with SSD: 

 

Old code SSD: 3.31s user 17.87s system 244% cpu 8.667 total

 

New code SDD: 2.88s user 0.84s system 69% cpu 5.369 total

 

Shows you can throw hardware at a problem BUT the old code locks up my system 
momentarily..

 


On Saturday, October 15, 2016 at 3:27:38 PM UTC-4, Kevin Malachowski wrote:

Sorry, I meant that calling Write on the hash type might be slower if it's 
called more often.

(I'm on mobile right now. When I get back to a keyboard I'll try to come up 
with an example)

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to