Re: [s3ql] Unclear on zero deduplication and cache for local backend

Mike Wed, 26 Nov 2014 15:22:35 -0800

Thanks for the detailed reply.  I'm wondering now about the deduplication. 
 Do you have an impression on what it would take from a code perspective to 
split an incoming object into blocks so that there is block-level 
deduplication, or even variable-block deduplication?  How hard would this 
be to implement from your perspective?  If you have some feedback here I'd 
be interested to know.


Mike

On Tuesday, September 30, 2014 7:05:44 PM UTC-7, Nikolaus Rath wrote:
>
> [email protected] <javascript:> writes: 
> > Hi, 
> > 
> > I have recently tried out s3ql on Debian testing, and I have a few 
> > questions. 
> > 
> > I'm using s3ql with local storage, without encryption or compression. 
> > I set threads to 1 as a baseline 
> [...] . 
> > I find when I specify cachesize manually to be small or zero that my 
> > write throughput goes down by several orders of magnitude.  Is using 
> > no cache unsupported? 
>
> Yes, this is not supported. You are right that if the backend storage is 
> a local disk, this could be made to work. However, S3QL was designed for 
> network storage, and the "local" storage backend was added for use 
> with a network file system (like sshfs) and testing, and not as an 
> efficient method to utilize your local disk. 
>
> In theory, there are several optimizations one could implement with the 
> local backend (not requiring a cache being one of them). However, I 
> don't think this is worth it. I don't think that even with additional 
> optimizations, there'd be little reason not to use e.g. dm-crypt with 
> btrfs to get very similar features with orders of magnitude better 
> performance. 
>
> > I don't mind a small performance loss but when I use a zero cache size 
> > I get throughput of around 50 kilobytes per second, which suggests 
> > that I'm running up against an unexpected code path.  Read performance 
> > is okay even in that case. 
>
> I think with zero cache, S3QL probably downloads, updates, uploads and 
> removes a cache entry for every single read() call. 
>
> > The next thing I'm wondering a lot about is the deduplication.  In my 
> > test, I'm writing all zeroes.  I write a megabyte using one block of a 
> > 1MB block size using dd, and then I write 1024 blocks of a kilobyte 
> > each.  I then also write 2MB or 4MB at a time.  I'd expect that 
> > deduplication would catch these very trivial cases and that I'd only 
> > see one entry of at most 2^n bytes, where 2^n represents the 
> > approximate block size of the deduplication. 
>
> Yes, this is what should happen. 
>
> > I'd also expect 2^n to be smaller than a megabyte (maybe like a single 
> > 64k block). 
>
> That's probably not the case. S3QL de-duplicates on the level of storage 
> objects. You specify the maximum storage object size at mkfs.s3ql time 
> with the --blocksize option, and the default is 10 MB. 
>
> To see de-duplication in action, you either need to write more data, or 
> you need to write smaller, but identical files: 
>
> $ echo hello, world > foo 
> $ echo hello, world > bar 
>
> ..in this case s3ql will store only one storage object (containing 
> "hello, world") in the backend. 
>
>
> Best, 
> -Nikolaus 
>
> -- 
> GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F 
> Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F 
>
>              »Time flies like an arrow, fruit flies like a Banana.« 
>

-- 
You received this message because you are subscribed to the Google Groups 
"s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [s3ql] Unclear on zero deduplication and cache for local backend

Reply via email to