Re: [Libhugetlbfs-devel] [RFC][PATCH] Rework segment sharing code [Re: hugetlbd]

Nishanth Aravamudan Sat, 14 Oct 2006 12:33:57 -0700

On 14.10.2006 [12:45:49 +1000], David Gibson wrote:
> On Fri, Oct 13, 2006 at 05:14:00PM -0700, Nishanth Aravamudan wrote:
> > On 13.10.2006 [12:47:25 -0700], Nishanth Aravamudan wrote:
> > > On 13.10.2006 [12:02:44 +1000], David Gibson wrote:
> > > > Nish, or someone else who understands the sharing implementation:
> > > > 
> > > > Can you explain to me the need for the hugetlbd?  i.e. why is it
> > > > necessary to have a daemon to manage the shared mappings, rather than
> > > > just using a naming convention within the hugetlbfs to allow the
> > > > libhugetlbfs instances to directly find the right shared file?
> > > 
> > > This is a very good question, and one I don't have the best answer to
> > > right now. As I thought of various reasons that using a naming
> > > convention *wouldn't* work, I immediately came up with ways to work
> > > around issues.
> > > 
> > > Let me work up a patch that might get rid of *a lot* of code and see
> > > if I can't test it a bit. Give me a few days, though.
> > 
> > Hrm, some Fridays are better than others, apparently.
> > 
> > Description: Remove all of the daemonized sharing code and replace it
> > with a filename-based approach. The general flow of the new sharing code
> > is now:
> 
> Yay!


Thanks :)

> > 1) First sharer (aka preparer) will open() the hugetlbfs file with
> > O_EXCL|O_CREAT. All other sharers will fail this open and open the file
> > O_RDONLY only.
> > 2) The preparer will then LOCK_EX flock() the file, to cause the other
> > sharers to block on their flock() calls, as they try to use LOCK_SH.
> 
> Ok, do we have a small race here if we get:
>       CPU A                   CPU B
>       open(O_EXCL)
>                               open(O_RDONLY)
>                               flock(LOCK_SH)
>       flock(LOCK_EX)
> 
> We may need some atomic renames to work around this.

Yes, I agree. Would you suggest I create a temporary file (only known to
CPU A's process as the preparer), then use rename() to change the file
name (after successful prepare) to the appropriate one? Will the
flock(LOCK_EX) go with the rename()? Also, is rename() sufficiently
atomic in this context? Finally, would we then want the CPU B (any
sharer really), to loop on the open until it succeeds, with an
appropriate timeout? Or perhaps, rather than go the heavy-handed open()
route, just wait til access(file, F_OK) returns success (still with a
timeout, though?)

Obviously, this is a case where semaphores and such would be nice, but
since we don't have a main() context which is globally "first", I can't
think of a way to initialize things appropriately.

> > 3) The preparer then prepares the file, that is it copies the data as
> > before. If that fails, the preparer unlinks the file.
> > 4) Regardless of the success of the prepare, the preparer will LOCK_SH
> > and LOCK_UN the file, to trigger all the other sharers to continue. If
> > the prepare failed, all the sharers will try to access() the file and
> > fail, and fall back to unlinked files. If the file does exist, though,
> > it is assumed to be safe to use as a copy of the segment.
> > 
> > This seems noticeably faster than the hugetlbd method. The files in the
> > hugetlbfs mount point are uniquely named as:
> > 
> > $(executable name)_$(gid of user)_$(word size)_$(program header)
> > 
> > Thus, sharing is only possible between members of the same group
> > (security concern), and 32-bit executables won't share the same file as
> > 64-bit executables. The program header is what differentiates between
> > multiple segments of the same executable.
> 
> Ok, looking pretty good.  A few minor concerns for later revisions:
>       - We may need to /-escape the executable's full path or
> otherwise be careful that our identifier for the executable really is
> unique enough.

Ah true, I hadn't thought of that (I was just happy the code worked on
my first try :) I was trying to avoid dealing with the full path, since
if we append any further information (although your directory idea can
get us around the gid appending), then we may get close to PATH_MAX,
potentially. Not sure if that's a real concern, though.

I guess the real issue is if some system has two binaries, both the
same word-size and run by the same user, but in different locations and
being different binaries, we run into an issue. That seems like a bad
system to me, but I guess I do have some binaries in my ~/bin and some
in /usr/local/bin or in /usr/bin, where my ~/bin version is customized
to my own usage of the program. Hrm, I'll think about making this more
unique, if I can. I'd rather like to avoid trying to create a unique
hash on the binary :) I'd guess any such hash would require another
library, which is a dependency we've tried to avoid for libhugetlbfs.

>       - There's still a security concern here - someone who's not
> the right group could drop in a file with the expected gid of someone
> else.  Bad.  I think a better approach would be to use per-group, or
> per-user subdirectories within the hugetlbfs, which can be made
> writable by only the right people.  I think having an environment
> configurable place for the cached hugepage files, rather than dumping
> them all in the top of the first hugetlbfs we find isn't a bad idea in
> any case.

Ah yes, configurability is a good thing. I was trying to go for the
simple approach, just to get this discussion going. TBH, Adam and I did
discuss the idea of having directories to ease in organization. I like
the per-group subdirectory idea, I'll make some modifications to my
local tree and see how complicated it gets (I don't expect it to be too
bad).

>       It's perhaps worth noting that the current daemon
> implementation also has a gaping security hole, since there's a window
> where the file is world writable.  And even if that's fixed the daemon
> doesn't actually enforce that the first preparer puts the correct data
> into the hugetlbfs file.

Right. I'm not usre how the daemon (or even the new code) can "enforce"
that the correct data is in the file. We have to trust the library to do
the right thing. I suppose you might be referring to the case, where a
rogue binary knows our socket transmissions (for the daemon case) and
fakes it. In that sense, this new code is more secure, as it's all
contained within our library.

And the perms issue is due to the daemon running as root. I could change
the daemon to chown the file temporarily to the right UID and then chmod
it R/W only for user. But, I think with some work, this new code will be
much better.

>       - For ease of management, it may be good to pack all the
> segments of an executable into the one hugetlbfs file at different
> offsets. (I had been planning to eventually get around to that for the
> unlinked file case since way back)

Also true, and my very early implementations of the sharing were similar
to this, in the sense that we treated each hugetlbfs file almost like an
address space (Adam's analogy, iirc) for some executable. I dropped it
as it was sort of orthogonal to the current work. The other problem with
that is that the current per-segment files allow for some
parallelization, eventually, where if the first sharer attempts to grab
the file for the BSS/DATA, say, and fails, it can try to grab the file
for the TEXT, which probably should succeed, since the preparer is
working on the BSS/DATA, and then prepare that. That would reduce the
latency even further (although it is quite low already). Just a thought,
and the balance between parallelization and management may lean much
further in favor of your idea, I'm just not sure yet.

> > Compile and run-tested via make func on ppc64 and x86_64.
> > 
> > This is still an RFC, though, as I'm not sure of a good way to remove
> > the files from the hugetlbfs mount point, without having effectively a
> > daemon very similar to hugetlbd, but whose only job is to track atimes
> > of files in the hugetlbfs mount point and unlink "old" ones (as
> > specified at daemon start time). If that seems like a reasonable
> > approach, I will modify hugetlbd accordingly.
> 
> No need for a daemon, we can just run a cleanup script from cron.
> Actually, I think it might be a good idea to have a "hugetlbtool" or
> somesuch program which could do the cleanup, or could be used to
> prepopulate the hugetlbfs cache with a specified executable.  The last
> option could be particularly useful if we ever encounter a big enough
> 32-bit program that we run out of address space in the part where both
> the original mappings and the hugepage SHARED mappings exist
> simultaneously.

Yeah, sorry, by daemon, I meant some binary. It wouldn't have to be
daemonized like the one we have now, I suppose, and cron is a nice
solution (and I guess that would mean it goes in one of the system wide
crontabs?

The reason I was suggesting a daemon, though, is one of the nicer
features of hugetlbd now: the administrator can specify two different
timeouts for the daemon. One, irrelevant to this discussion, deals with
how long to listen on the socket before we try to reap files from the
hugetlbfs mount-point. A second timeout, which is rather more important
to this discussion, deals with how recently someone should have last
tried to use the shared file to cause us *not* to reap it. The idea
being that if a file has not been shared within 5 minutes (the default,
iirc) then it probably will not be shared again and can go away.

I suppose the cleanup script (or hugetlbtool or whatever), if in bash,
could do

        find $(hugetlbfs_mnt) -type f -amin +5 -exec unlink {} \;

or something. That of course relies on the atime being appropriately
updated, which I think it will be by the sharer's open().

Regardless, I agree it would be nice to have a simple tool to take care
of things like this (and presumably as time goes on, there will be more
such considerations).

> Or there's the hacky way, that will actually get the right result in
> the cases we're interested in: have the preparer unlink the file a few
> minutes after it's created.  AFAIK our actual use cases are things
> like Oracle where all the sharing PIDs will be started at roughly the
> same time.

Yup, exactly, and that's why the daemon behaves the way it does, the
uses case was "a bunch of processes will share at the same time, then no
one will share again ever". I think my find command from above, run via
cron regularly, will strike a good balance between maximizing sharing,
and minimizing the consumption of hugepages.

Thanks,
Nish

-- 
Nishanth Aravamudan <[EMAIL PROTECTED]>
IBM Linux Technology Center

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Libhugetlbfs-devel mailing list
Libhugetlbfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel

Re: [Libhugetlbfs-devel] [RFC][PATCH] Rework segment sharing code [Re: hugetlbd]

Reply via email to