On Fri, Aug 30, 2013 at 9:15 PM, Andres Freund <and...@2ndquadrant.com> wrote: > Hi, > > On 2013-08-28 15:20:57 -0400, Robert Haas wrote: >> > That way any corruption in that area will prevent restarts without >> > reboot unless you use ipcrm, or such, right? >> >> The way I've designed it, no. If what we expect to be the control >> segment doesn't exist or doesn't conform to our expectations, we just >> assume that it's not really the control segment after all - e.g. >> someone rebooted, clearing all the segments, and then an unrelated >> process (malicious, perhaps, or just a completely different cluster) >> reused the same name. This is similar to what we do for the main >> shared memory segment. > > The case I am mostly wondering about is some process crashing and > overwriting random memory. We need to be pretty sure that we'll never > fail partially through cleaning up old segments because they are > corrupted or because we died halfway through our last cleanup attempt. > >> > I think we want that during development, but I'd rather not go there >> > when releasing. After all, we don't support a manual choice between >> > anonymous mmap/sysv shmem either. > >> That's true, but that decision has not been uncontroversial - e.g. the >> NetBSD guys don't like it, because they have a big performance >> difference between those two types of memory. We have to balance the >> possible harm of one more setting against the benefit of letting >> people do what they want without needing to recompile or modify code. > > But then, it made them fix the issue afaik :P > >> >> In addition, I've included an implementation based on mmap of a plain >> >> file. As compared with a true shared memory implementation, this >> >> obviously has the disadvantage that the OS may be more likely to >> >> decide to write back dirty pages to disk, which could hurt >> >> performance. However, I believe it's worthy of inclusion all the >> >> same, because there are a variety of situations in which it might be >> >> more convenient than one of the other implementations. One is >> >> debugging. >> > >> > Hm. Not sure what's the advantage over a corefile here. > >> You can look at it while the server's running. > > That's what debuggers are for. > >> >> On MacOS X, for example, there seems to be no way to list >> >> POSIX shared memory segments, and no easy way to inspect the contents >> >> of either POSIX or System V shared memory segments. > >> > Shouldn't we ourselves know which segments are around? > >> Sure, that's the point of the control segment. But listing a >> directory is a lot easier than figuring out what the current control >> segment contents are. > > But without a good amount of tooling - like in a debugger... - it's not > very interesting to look at those files either way? The mere presence of > a segment doesn't tell you much and the contents won't be easily > readable. > >> >> Another use case is working around an administrator-imposed or >> >> OS-imposed shared memory limit. If you're not allowed to allocate >> >> shared memory, but you are allowed to create files, then this >> >> implementation will let you use whatever facilities we build on top >> >> of dynamic shared memory anyway. >> > >> > I don't think we should try to work around limits like that. > >> I do. There's probably someone, somewhere in the world who thinks >> that operating system shared memory limits are a good idea, but I have >> not met any such person. > > "Let's drive users away from sysv shem" is the only one I heard so far ;) > >> I would never advocate deliberately trying to circumvent a >> carefully-considered OS-level policy decision about resource >> utilization, but I don't think that's the dynamic here. I think if we >> insist on predetermining the dynamic shared memory implementation >> based on the OS, we'll just be inconveniencing people needlessly, or >> flat-out making things not work. [...] > > But using file-backed memory will *suck* performancewise. Why should we > ever want to offer that to a user? That's what I was arguing about > primarily. > >> If we're SURE >> that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in >> 100% of cases, and that a NetBSD user will always prefer "sysv" over >> "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in. >> But I'm not that sure. > > I think posix shmem will be preferred to sysv shmem if present, in just > about any relevant case. I don't know of any system with lower limits on > posix shmem than on sysv. > >> I think this case is roughly similar >> to wal_sync_method: there really shouldn't be a performance or >> reliability difference between the ~6 ways of flushing a file to disk, >> but as it turns out, there is, so we have an option. > > Well, most of them actually give different guarantees, so it makes sense > to have differing performance... > >> > Why do we want to expose something unreliable as preferred_address to >> > the external interface? I haven't read the code yet, so I might be >> > missing something here. > >> I shared your opinion that preferred_address is never going to be >> reliable, although FWIW Noah thinks it can be made reliable with a >> large-enough hammer. > > I think we need to have the arguments for that on list then. Those are > pretty damn fundamental design decisions. > I for one cannot see how you even remotely could make that work a) on > windows (check the troubles we have to go through to get s_b > consistently placed, and that's directly after startup) b) 32bit systems.
For Windows, I believe we are already doing something similar (attaching at predefined address) in main shared memory. It reserves memory at particular address using pgwin32_ReserveSharedMemoryRegion() before actually starting (resuming process created in suspend mode) a process and then after starting backend attaches at same address (PGSharedMemoryReAttach). I think one question here is what is use of exposing preffered_address, to which I can think of only below: a. Base OS API's provide such provision, then why don't we? b. While browsing, I found few examples in IBM site where they also show usage with preferred address. http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=%2Fcom.ibm.vacpp7a.doc%2Fproguide%2Fref%2Fcreate_heap.htm c. If user wishes to attach segments at same base address, so that it can access pointers in the memory mapped file which otherwise would not be possible. >> But even if it isn't reliable, there doesn't seem to be all that much >> value in forbidding access to that part of the OS-provided API. In >> the world where it's not reliable, it may still be convenient to map >> things at the same address when you can, so that pointers can't be >> used. Of course you'd have to have some fallback strategy for when >> you don't get the same mapping, and maybe that's painful enough that >> there's no point after all. Or maybe it's worth having one code path >> for relativized pointers and another for non-relativized pointers. > > It seems likely to me that will end up with untested code in that > case. Or even unsupported platforms. > >> To be honest, I'm not real sure. I think it's clear enough that this >> will meet the minimal requirements for parallel query - ONE dynamic >> shared memory segment that's not guaranteed to be at the same address >> in every backend, and can't be resized after creation. And we could >> pare the API down to only support that. But I'd rather get some >> experience with this first before we start taking away options. >> Otherwise, we may never really find out the limits of what is possible >> in this area, and I think that would be a shame. > > On the other hand, adding capabilities annoys people far much than > deciding that we can't support them in the end and taking them away. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers