[cc Jann - you love this stuff]
> On Jul 10, 2018, at 3:44 PM, David Howells <[email protected]> wrote:
>
> Provide an fsopen() system call that starts the process of preparing to
> create a superblock that will then be mountable, using an fd as a context
> handle. fsopen() is given the name of the filesystem that will be used:
>
> int mfd = fsopen(const char *fsname, unsigned int flags);
This is great in principle, but I think you’re seriously playing with fire with
the API.
>
> where flags can be 0 or FSOPEN_CLOEXEC.
>
> For example:
>
> sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> write(sfd, "s /dev/sdb1"); // note I'm ignoring write's length arg
Imagine some malicious program passes sfd as stdout to a setuid program. That
program gets persuaded to write “s /etc/shadow”. What happens? You’re okay as
long as *every single fs* gets it right, but that’s asking a lot.
> write(sfd, "o noatime");
> write(sfd, "o acl");
> write(sfd, "o user_attr");
> write(sfd, "o iversion");
> write(sfd, "o ");
> write(sfd, "r /my/container"); // root inside the fs
> write(sfd, "x create"); // create the superblock
From cursory inspection of a bunch of the code, I think the expectation is that
the actual device access happens in the “x” action. This is not okay. You can’t
do this kind of thing in a write() handler, unless you somehow make every
single access using f_cred, which is a real pain.
I think the right solution is one of:
(a) Pass a netlink-formatted blob to fsopen() and do the whole thing in one
syscall. I don’t mean using netlink sockets — just the nlattr format. Or you
could use a different format. The part that matters is using just one syscall
to do the whole thing.
(b) Keep the current structure but use a new syscall instead of write().
(c) Keep using write() but literally just buffer the data. Then have a new
syscall to commit it. In other words, replace “x” with a syscall and call all
the fs_context_operations helpers in that context instead of from write().