On 14.10.2006 [12:45:49 +1000], David Gibson wrote: > On Fri, Oct 13, 2006 at 05:14:00PM -0700, Nishanth Aravamudan wrote: > > On 13.10.2006 [12:47:25 -0700], Nishanth Aravamudan wrote: > > > On 13.10.2006 [12:02:44 +1000], David Gibson wrote: > > > > Nish, or someone else who understands the sharing implementation: > > > > > > > > Can you explain to me the need for the hugetlbd? i.e. why is it > > > > necessary to have a daemon to manage the shared mappings, rather than > > > > just using a naming convention within the hugetlbfs to allow the > > > > libhugetlbfs instances to directly find the right shared file? > > > > > > This is a very good question, and one I don't have the best answer to > > > right now. As I thought of various reasons that using a naming > > > convention *wouldn't* work, I immediately came up with ways to work > > > around issues. > > > > > > Let me work up a patch that might get rid of *a lot* of code and see > > > if I can't test it a bit. Give me a few days, though. > > > > Hrm, some Fridays are better than others, apparently. > > > > Description: Remove all of the daemonized sharing code and replace it > > with a filename-based approach. The general flow of the new sharing code > > is now: > > Yay!
Thanks :) > > 1) First sharer (aka preparer) will open() the hugetlbfs file with > > O_EXCL|O_CREAT. All other sharers will fail this open and open the file > > O_RDONLY only. > > 2) The preparer will then LOCK_EX flock() the file, to cause the other > > sharers to block on their flock() calls, as they try to use LOCK_SH. > > Ok, do we have a small race here if we get: > CPU A CPU B > open(O_EXCL) > open(O_RDONLY) > flock(LOCK_SH) > flock(LOCK_EX) > > We may need some atomic renames to work around this. Yes, I agree. Would you suggest I create a temporary file (only known to CPU A's process as the preparer), then use rename() to change the file name (after successful prepare) to the appropriate one? Will the flock(LOCK_EX) go with the rename()? Also, is rename() sufficiently atomic in this context? Finally, would we then want the CPU B (any sharer really), to loop on the open until it succeeds, with an appropriate timeout? Or perhaps, rather than go the heavy-handed open() route, just wait til access(file, F_OK) returns success (still with a timeout, though?) Obviously, this is a case where semaphores and such would be nice, but since we don't have a main() context which is globally "first", I can't think of a way to initialize things appropriately. > > 3) The preparer then prepares the file, that is it copies the data as > > before. If that fails, the preparer unlinks the file. > > 4) Regardless of the success of the prepare, the preparer will LOCK_SH > > and LOCK_UN the file, to trigger all the other sharers to continue. If > > the prepare failed, all the sharers will try to access() the file and > > fail, and fall back to unlinked files. If the file does exist, though, > > it is assumed to be safe to use as a copy of the segment. > > > > This seems noticeably faster than the hugetlbd method. The files in the > > hugetlbfs mount point are uniquely named as: > > > > $(executable name)_$(gid of user)_$(word size)_$(program header) > > > > Thus, sharing is only possible between members of the same group > > (security concern), and 32-bit executables won't share the same file as > > 64-bit executables. The program header is what differentiates between > > multiple segments of the same executable. > > Ok, looking pretty good. A few minor concerns for later revisions: > - We may need to /-escape the executable's full path or > otherwise be careful that our identifier for the executable really is > unique enough. Ah true, I hadn't thought of that (I was just happy the code worked on my first try :) I was trying to avoid dealing with the full path, since if we append any further information (although your directory idea can get us around the gid appending), then we may get close to PATH_MAX, potentially. Not sure if that's a real concern, though. I guess the real issue is if some system has two binaries, both the same word-size and run by the same user, but in different locations and being different binaries, we run into an issue. That seems like a bad system to me, but I guess I do have some binaries in my ~/bin and some in /usr/local/bin or in /usr/bin, where my ~/bin version is customized to my own usage of the program. Hrm, I'll think about making this more unique, if I can. I'd rather like to avoid trying to create a unique hash on the binary :) I'd guess any such hash would require another library, which is a dependency we've tried to avoid for libhugetlbfs. > - There's still a security concern here - someone who's not > the right group could drop in a file with the expected gid of someone > else. Bad. I think a better approach would be to use per-group, or > per-user subdirectories within the hugetlbfs, which can be made > writable by only the right people. I think having an environment > configurable place for the cached hugepage files, rather than dumping > them all in the top of the first hugetlbfs we find isn't a bad idea in > any case. Ah yes, configurability is a good thing. I was trying to go for the simple approach, just to get this discussion going. TBH, Adam and I did discuss the idea of having directories to ease in organization. I like the per-group subdirectory idea, I'll make some modifications to my local tree and see how complicated it gets (I don't expect it to be too bad). > It's perhaps worth noting that the current daemon > implementation also has a gaping security hole, since there's a window > where the file is world writable. And even if that's fixed the daemon > doesn't actually enforce that the first preparer puts the correct data > into the hugetlbfs file. Right. I'm not usre how the daemon (or even the new code) can "enforce" that the correct data is in the file. We have to trust the library to do the right thing. I suppose you might be referring to the case, where a rogue binary knows our socket transmissions (for the daemon case) and fakes it. In that sense, this new code is more secure, as it's all contained within our library. And the perms issue is due to the daemon running as root. I could change the daemon to chown the file temporarily to the right UID and then chmod it R/W only for user. But, I think with some work, this new code will be much better. > - For ease of management, it may be good to pack all the > segments of an executable into the one hugetlbfs file at different > offsets. (I had been planning to eventually get around to that for the > unlinked file case since way back) Also true, and my very early implementations of the sharing were similar to this, in the sense that we treated each hugetlbfs file almost like an address space (Adam's analogy, iirc) for some executable. I dropped it as it was sort of orthogonal to the current work. The other problem with that is that the current per-segment files allow for some parallelization, eventually, where if the first sharer attempts to grab the file for the BSS/DATA, say, and fails, it can try to grab the file for the TEXT, which probably should succeed, since the preparer is working on the BSS/DATA, and then prepare that. That would reduce the latency even further (although it is quite low already). Just a thought, and the balance between parallelization and management may lean much further in favor of your idea, I'm just not sure yet. > > Compile and run-tested via make func on ppc64 and x86_64. > > > > This is still an RFC, though, as I'm not sure of a good way to remove > > the files from the hugetlbfs mount point, without having effectively a > > daemon very similar to hugetlbd, but whose only job is to track atimes > > of files in the hugetlbfs mount point and unlink "old" ones (as > > specified at daemon start time). If that seems like a reasonable > > approach, I will modify hugetlbd accordingly. > > No need for a daemon, we can just run a cleanup script from cron. > Actually, I think it might be a good idea to have a "hugetlbtool" or > somesuch program which could do the cleanup, or could be used to > prepopulate the hugetlbfs cache with a specified executable. The last > option could be particularly useful if we ever encounter a big enough > 32-bit program that we run out of address space in the part where both > the original mappings and the hugepage SHARED mappings exist > simultaneously. Yeah, sorry, by daemon, I meant some binary. It wouldn't have to be daemonized like the one we have now, I suppose, and cron is a nice solution (and I guess that would mean it goes in one of the system wide crontabs? The reason I was suggesting a daemon, though, is one of the nicer features of hugetlbd now: the administrator can specify two different timeouts for the daemon. One, irrelevant to this discussion, deals with how long to listen on the socket before we try to reap files from the hugetlbfs mount-point. A second timeout, which is rather more important to this discussion, deals with how recently someone should have last tried to use the shared file to cause us *not* to reap it. The idea being that if a file has not been shared within 5 minutes (the default, iirc) then it probably will not be shared again and can go away. I suppose the cleanup script (or hugetlbtool or whatever), if in bash, could do find $(hugetlbfs_mnt) -type f -amin +5 -exec unlink {} \; or something. That of course relies on the atime being appropriately updated, which I think it will be by the sharer's open(). Regardless, I agree it would be nice to have a simple tool to take care of things like this (and presumably as time goes on, there will be more such considerations). > Or there's the hacky way, that will actually get the right result in > the cases we're interested in: have the preparer unlink the file a few > minutes after it's created. AFAIK our actual use cases are things > like Oracle where all the sharing PIDs will be started at roughly the > same time. Yup, exactly, and that's why the daemon behaves the way it does, the uses case was "a bunch of processes will share at the same time, then no one will share again ever". I think my find command from above, run via cron regularly, will strike a good balance between maximizing sharing, and minimizing the consumption of hugepages. Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel