Re: [Libhugetlbfs-devel] [RFC][PATCH] Rework segment sharing code [Re: hugetlbd]

David Gibson Sat, 14 Oct 2006 00:06:32 -0700

On Fri, Oct 13, 2006 at 05:14:00PM -0700, Nishanth Aravamudan wrote:
> On 13.10.2006 [12:47:25 -0700], Nishanth Aravamudan wrote:
> > On 13.10.2006 [12:02:44 +1000], David Gibson wrote:
> > > Nish, or someone else who understands the sharing implementation:
> > > 
> > > Can you explain to me the need for the hugetlbd?  i.e. why is it
> > > necessary to have a daemon to manage the shared mappings, rather than
> > > just using a naming convention within the hugetlbfs to allow the
> > > libhugetlbfs instances to directly find the right shared file?
> > 
> > This is a very good question, and one I don't have the best answer to
> > right now. As I thought of various reasons that using a naming
> > convention *wouldn't* work, I immediately came up with ways to work
> > around issues.
> > 
> > Let me work up a patch that might get rid of *a lot* of code and see
> > if I can't test it a bit. Give me a few days, though.
> 
> Hrm, some Fridays are better than others, apparently.
> 
> Description: Remove all of the daemonized sharing code and replace it
> with a filename-based approach. The general flow of the new sharing code
> is now:


Yay!

> 1) First sharer (aka preparer) will open() the hugetlbfs file with
> O_EXCL|O_CREAT. All other sharers will fail this open and open the file
> O_RDONLY only.
> 2) The preparer will then LOCK_EX flock() the file, to cause the other
> sharers to block on their flock() calls, as they try to use LOCK_SH.

Ok, do we have a small race here if we get:
        CPU A                   CPU B
        open(O_EXCL)
                                open(O_RDONLY)
                                flock(LOCK_SH)
        flock(LOCK_EX)

We may need some atomic renames to work around this.

> 3) The preparer then prepares the file, that is it copies the data as
> before. If that fails, the preparer unlinks the file.
> 4) Regardless of the success of the prepare, the preparer will LOCK_SH
> and LOCK_UN the file, to trigger all the other sharers to continue. If
> the prepare failed, all the sharers will try to access() the file and
> fail, and fall back to unlinked files. If the file does exist, though,
> it is assumed to be safe to use as a copy of the segment.
> 
> This seems noticeably faster than the hugetlbd method. The files in the
> hugetlbfs mount point are uniquely named as:
> 
> $(executable name)_$(gid of user)_$(word size)_$(program header)
> 
> Thus, sharing is only possible between members of the same group
> (security concern), and 32-bit executables won't share the same file as
> 64-bit executables. The program header is what differentiates between
> multiple segments of the same executable.

Ok, looking pretty good.  A few minor concerns for later revisions:
        - We may need to /-escape the executable's full path or
otherwise be careful that our identifier for the executable really is
unique enough.

        - There's still a security concern here - someone who's not
the right group could drop in a file with the expected gid of someone
else.  Bad.  I think a better approach would be to use per-group, or
per-user subdirectories within the hugetlbfs, which can be made
writable by only the right people.  I think having an environment
configurable place for the cached hugepage files, rather than dumping
them all in the top of the first hugetlbfs we find isn't a bad idea in
any case.

        It's perhaps worth noting that the current daemon
implementation also has a gaping security hole, since there's a window
where the file is world writable.  And even if that's fixed the daemon
doesn't actually enforce that the first preparer puts the correct data
into the hugetlbfs file.

        - For ease of management, it may be good to pack all the
segments of an executable into the one hugetlbfs file at different
offsets. (I had been planning to eventually get around to that for the
unlinked file case since way back)

> Compile and run-tested via make func on ppc64 and x86_64.
> 
> This is still an RFC, though, as I'm not sure of a good way to remove
> the files from the hugetlbfs mount point, without having effectively a
> daemon very similar to hugetlbd, but whose only job is to track atimes
> of files in the hugetlbfs mount point and unlink "old" ones (as
> specified at daemon start time). If that seems like a reasonable
> approach, I will modify hugetlbd accordingly.

No need for a daemon, we can just run a cleanup script from cron.
Actually, I think it might be a good idea to have a "hugetlbtool" or
somesuch program which could do the cleanup, or could be used to
prepopulate the hugetlbfs cache with a specified executable.  The last
option could be particularly useful if we ever encounter a big enough
32-bit program that we run out of address space in the part where both
the original mappings and the hugepage SHARED mappings exist
simultaneously.

Or there's the hacky way, that will actually get the right result in
the cases we're interested in: have the preparer unlink the file a few
minutes after it's created.  AFAIK our actual use cases are things
like Oracle where all the sharing PIDs will be started at roughly the
same time.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Libhugetlbfs-devel mailing list
Libhugetlbfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel

Re: [Libhugetlbfs-devel] [RFC][PATCH] Rework segment sharing code [Re: hugetlbd]

Reply via email to