Hey Phil, answers inline...

Thanks,
Dan

On Saturday, December 17, 2016, Phil Lello <[email protected]
<javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:

> Hi all,
>
> Hopefully I haven't missed a recent thread on this subject, and I've
> picked the right list.
>
> I've recently been using ZFS as a storage engine for docker, and am really
> impressed by the way I can instantly spin up a container with 50GB of
> database, thanks to the magic of ZFS clones. This makes my life
> significantly easier, as a database rollback no longer involves heavy
> copying, re-indexing dumps, and similar. So, kudos there.
>
> Whilst this works well for images that get built locally, sharing gets
> more problematic. There are a couple or related issues here, only some of
> which are related to ZFS, but listed here so that my chain of reasoning is
> clear.
>
> 1. Docker registry uses compressed tarballs containing changed files
> between layers. That's not too bad, as I could fairly easily hack it to
> extract the layers to regular directories, and generate the compressed
> tarballs on the fly.
>

I'm not sure I understand what you're saying... why can't you use the
tarballs normally? (Are you saying you'd make one filesystem per layer,
with the more-nested layers being clones of the less-nested ones? I think
that would make sense here, but I'm confused what you're using the
directories for.)

2. If one big file changes (such as one row in a database table), docker
> registry will store the whole file in the image (which my hacked registry
> would be extracting to a normal tree). Whilst I could use zfs dedup, my
> researches indicate this is pretty memory intensive. I'd prefer a mechanism
> where I supply ZFS with the paths to two nearly identical files and it
> dedups the blocks. If this was doable while writing the derived file
> instead of after writing all the data to disk, that'd be even better.
>

The solution you're proposing is kind of similar to a feature in ZFS called
"nop write", where overwriting a block in a file with an exact duplicate of
the data that's already there will result in no write being issued. I don't
think you have to do anything to enable this behavior. This saves space on
the receiving side, but you'll still end up sending a huge amount of
unneeded data (and processing it on the receiving side to know that you can
discard the unchanged blocks).

3. It'd potentially be really useful to trigger ZFS snapshot/clone
> functionality from within a container. Whilst this is possibly more ZoL
> than OpenZFS, I'm assuming the main work would be a limited device driver
> to filter IOCTLs from within the container to the normal host driver, and
> this would hopefully be similar for BSD jails, Linux containers.
>

I'd recommend reaching out to ZoL specifically for this, but I agree with
the area of the code that you think the changes would be in.

An alternative to 1&2 would potentially be to teach docker and
> docker-registry to negotiate zfs send/recv, but that's a much steeper
> learning curve for me.
>

It sounds like you might not want to hear this :-), but using "zfs send" is
really the correct solution to this problem if you're using ZFS. It creates
a "send stream" file (analogous to the tarballs in your current design)
which contains just the changes to the filesystem since the last snapshot.
It's kind of like running a packet capture on "rsync", except that it's
built into the filesystem and doesn't require access to the receiving side
to know what has changed. On the receiving side, instead of un-tarring you
would simply run "zfs receive" on the stream file.

Unlike dedup, send/receive doesn't place any storage design requirements on
the filesystems you're sending from / receiving into. Unlike tar, it *only*
traverses new blocks, so it will be faster to traverse the filesystem
looking for diffs and will make the amount of data to transfer much smaller
for small changes to big files. It also has the added benefit of migrating
all file metadata along with the files (permissions, modify/create times,
ACLs, ...). It's designed for this exact use case (replicating data from
one system to another).

It feels like the existing tarball system is trying to replicate what
send/receive does in an FS-independent way, and as a result there are also
a bunch of things it doesn't do quite as nicely. The positive is obviously
that you don't need ZFS to use it :). However, I believe there was a
hackathon project a few years back to support receiving send streams on
(non-ZFS) POSIX filesystems which could help with this.

Has anyone given consideration to these issues/use case before, and either
> way, is anyone interested in collaboration?
>
> Best wishes,
>
> Phil Lello
>



-------------------------------------------
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com

Reply via email to