On Saturday, 6 May 2017 at 07:21:51 UTC, Johannes Pfau wrote:
Am Mon, 01 May 2017 21:01:43 +0000
schrieb notna <notna.remove.t...@ist-einmalig.de>:
Hi Dlander's.
Found some interesting reads ([1] [2] [3]) about the $SUBJECT
and wonder if there is anything available in the Dland?!
If yes, pls. share.
If not, how could it be done (D'ish)
[1] -
https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
-
https://github.com/moinakg/pcompress/blob/master/rabin/rabin_dedup.c
[2] -
https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc
[3] - http://www.infoarena.ro/blog/rolling-hash
Thanks & regards
Interesting concept. I'm not aware of any D implementation but
it shouldn't be difficult to implement this in D:
https://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial
There's a BSD licensed haskell implementation, so a BSD
licensed port
would be very easy to implement:
https://hackage.haskell.org/package/optimal-blocks-0.1.0
https://hackage.haskell.org/package/optimal-blocks-0.1.0/docs/src/Algorithm-OptimalBlocks-BuzzHash.html
To make an implementation D'ish it could integrate with either
std.digest or process input ranges. If you want to use it
exclusively for chunking your code can be more efficient
(process InputRange until a boundary condition is met). When
using input ranges, prefer some kind of buffered approach,
Range!ubyte[] instead of Range!ubyte for better performance.
If you really want the rolling hash value for each byte in a
sequence this will be less efficient as you'll have to enter
data byte-by-byte. In this case it's extremely important for
performance that your function can be inlined, so use templates:
ubyte[] data;
foreach(b; data)
{
// This needs to be inlined for performance reasons
rollinghash.put(b);
}
-- Johannes
Thanks for the feedback, Johannes.
I hoped there may already be something in Mir or Weka.io or
somewhere else... Will read the Golang, C and C++ source and see
if my Dlang is good enough for ranges and the like magic...