Re: [s3ql] Unclear on zero deduplication and cache for local backend

Nikolaus Rath Thu, 27 Nov 2014 12:09:57 -0800

Hi Mike,

When replying to emails on this list, please do not put your reply
above the quoted text, and do not quote the entire message you're
answering to. This makes it unnecessarily hard for other readers to
understand the context of your email. Instead, please cut quoted parts
that are not relevant to your reply, and insert your responses right
after the points you're replying to (as I have done below). Thanks!


Mike <[email protected]> writes:
>> > The next thing I'm wondering a lot about is the deduplication.  In my 
>> > test, I'm writing all zeroes.  I write a megabyte using one block of a 
>> > 1MB block size using dd, and then I write 1024 blocks of a kilobyte 
>> > each.  I then also write 2MB or 4MB at a time.  I'd expect that 
>> > deduplication would catch these very trivial cases and that I'd only 
>> > see one entry of at most 2^n bytes, where 2^n represents the 
>> > approximate block size of the deduplication. 
>>
>> Yes, this is what should happen. 
>>
>> > I'd also expect 2^n to be smaller than a megabyte (maybe like a single 
>> > 64k block). 
>>
>> That's probably not the case. S3QL de-duplicates on the level of storage 
>> objects. You specify the maximum storage object size at mkfs.s3ql time 
>> with the --blocksize option, and the default is 10 MB. 
>>
>> To see de-duplication in action, you either need to write more data, or 
>> you need to write smaller, but identical files: 
>>
>> $ echo hello, world > foo 
>> $ echo hello, world > bar 
>>
>> ..in this case s3ql will store only one storage object (containing 
>> "hello, world") in the backend. 
>
> Thanks for the detailed reply.  I'm wondering now about the deduplication. 
>  Do you have an impression on what it would take from a code perspective to 
> split an incoming object into blocks so that there is block-level 
> deduplication, or even variable-block deduplication?  How hard would this 
> be to implement from your perspective?  If you have some feedback here I'd 
> be interested to know.

Not sure what you mean. As I said above, S3QL already splits files into
blocks and does block-based deduplication. The blocks are just bigger
than you thought.


Having variable-sized blocks would be a more difficult. I'm not sure if
the gains would be worth the necessary code complexity, but that's hard
to tell without having a concrete patch to look at.

Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

-- 
You received this message because you are subscribed to the Google Groups 
"s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [s3ql] Unclear on zero deduplication and cache for local backend

Reply via email to