[Devel] Re: [patch -rss] Make RSS accounting display more user friendly
On 6/22/07, Balbir Singh [EMAIL PROTECTED] wrote: The problem with input in bytes is that the user will have to ensure that the input is a multiple of page size, which implies that she would need to use the calculator every time. Having input in bytes seems pretty natural to me. Why not just have the RSS controller round the input to the nearest page (or whatever granularity of memory the controller is able to limit at)? Paul ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [patch -rss] Make RSS accounting display more user friendly
Paul Menage wrote: On 6/22/07, Balbir Singh [EMAIL PROTECTED] wrote: The problem with input in bytes is that the user will have to ensure that the input is a multiple of page size, which implies that she would need to use the calculator every time. Having input in bytes seems pretty natural to me. Why not just have the RSS controller round the input to the nearest page (or whatever granularity of memory the controller is able to limit at)? Paul Hi, Paul, I am not a CLUI expert, but rounding off bytes will something that the administrators will probably complain about. Since we manage the controller memory in pages, it might be the easiest unit to use. The output is totally different matter. Having said that, I am not opposed to your suggestion, I'll see if I can find good CLUI guidelines. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mm 2/2] x86_64: semi-rewrite of PTRACE_PEEKUSR, PTRACE_POKEUSR
On Wed, Jun 20, 2007 at 02:41:48PM -0700, Roland McGrath wrote: What's the purpose of the change? Chopping small bits of utrace to mainline. regset stuff looks reasonable and self-contained enough to start with. However, regset part in utrace contain quite a few unused things, so I'm leaving those alone. Their time will come (or won't). This way we can merge non-racy stuff first and leave core utrace for later. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [NETFILTER] early_drop() imrovement (v3)
Vasily Averin wrote: +static int early_drop(const struct nf_conntrack_tuple *orig) +{ + unsigned int i, hash, cnt; + int ret = 0; + + hash = hash_conntrack(orig); + cnt = NF_CT_PER_BUCKET; + + for (i = 0; + !ret cnt i nf_conntrack_htable_size; + ++i, hash = ++hash % nf_conntrack_htable_size) + ret = __early_drop(nf_conntrack_hash[hash], cnt); Formatting is a bit ugly, looks much nicer as: for (i = 0; i nf_conntrack_htable_size; i++) { ret = __early_drop(nf_conntrack_hash[hash], cnt); if (ret || !cnt) break; hash = ++hash % nf_conntrack_htable_size; } @@ -1226,7 +1243,7 @@ int __init nf_conntrack_init(void) if (nf_conntrack_htable_size 16) nf_conntrack_htable_size = 16; } - nf_conntrack_max = 8 * nf_conntrack_htable_size; + nf_conntrack_max = NF_CT_PER_BUCKET * nf_conntrack_htable_size; I don't like the NF_CT_PER_BUCKET constant. First of all, each conntrack is hashed twice, so its really only 1/2 of the average conntracks per bucket. Secondly, its only a default and many people use nf_conntrack_max = nf_conntrack_htable_size / 2, so using this constant for early_drop seems wrong. Perhaps make it 2 * nf_conntrack_max / nf_conntrack_htable_size or even add a nf_conntrack_eviction_range sysctl. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFD] L2 Network namespace infrastructure
Quoting David Miller ([EMAIL PROTECTED]): From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Sat, 23 Jun 2007 11:19:34 -0600 Further and fundamentally all a global achieves is removing the need for the noise patches where you pass the pointer into the various functions. For long term maintenance it doesn't help anything. I don't accept that we have to add another function argument to a bunch of core routines just to support this crap, especially since you give no way to turn it off and get that function argument slot back. To be honest I think this form of virtualization is a complete waste of time, even the openvz approach. We're protecting the kernel from itself, and that's an endless uphill battle that you will never win. Let's do this kind of Hi David, just to be clear this isn't so much about security. Security can be provided using selinux, just as with the userid namespace. But like with the userid namespace, this provides usability for the virtual servers, plus some support for restarting checkpointed applications. That doesn't attempt to justify the extra argument - if you don't like it, you don't like it :) Just wanted to clarify. thanks, -serge stuff properly with a real minimal hypervisor, hopefully with appropriate hardware level support and good virtualized device interfaces, instead of this namespace stuff. At least the hypervisor approach you have some chance to fully harden in some verifyable and truly protected way, with namespaces it's just a pipe dream and everyone who works on these namespace approaches knows that very well. The only positive thing that came out of this work is the great auditing that the openvz folks have done and the bugs they have found, but it basically ends right there. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFD] L2 Network namespace infrastructure
Quoting Jeff Garzik ([EMAIL PROTECTED]): Eric W. Biederman wrote: Jeff Garzik [EMAIL PROTECTED] writes: David Miller wrote: I don't accept that we have to add another function argument to a bunch of core routines just to support this crap, especially since you give no way to turn it off and get that function argument slot back. To be honest I think this form of virtualization is a complete waste of time, even the openvz approach. We're protecting the kernel from itself, and that's an endless uphill battle that you will never win. Let's do this kind of stuff properly with a real minimal hypervisor, hopefully with appropriate hardware level support and good virtualized device interfaces, instead of this namespace stuff. Strongly seconded. This containerized virtualization approach just bloats up the kernel for something that is inherently fragile and IMO less secure -- protecting the kernel from itself. Plenty of other virt approaches don't stir the code like this, while simultaneously providing fewer, more-clean entry points for the virtualization to occur. Wrong. I really don't want to get into a my virtualization approach is better then yours. But this is flat out wrong. 99% of the changes I'm talking about introducing are just: - variable + ptr-variable There are more pieces mostly with when we initialize those variables but that is the essence of the change. You completely dodged the main objection. Which is OK if you are selling something to marketing departments, but not OK Containers introduce chroot-jail-like features that give one a false sense of security, while still requiring one to poke holes in the illusion to get hardware-specific tasks accomplished. The capable/not-capable model (i.e. superuser / normal user) is _still_ being secured locally, even after decades of work and whitepapers and audits. You are drinking Deep Kool-Aid if you think adding containers to the myriad kernel subsystems does anything besides increasing fragility, and decreasing security. You are securing in-kernel subsystems against other in-kernel subsystems. No we're not. As the name 'network namespaces' implies, we are introducing namespaces for network-related variables. That's it. We are not trying to protect in-kernel subsystems from each other. In fact we're not even trying to protect userspace process from each other. Though that will in part come free when user processes can't access each other's data because they are in different namespaces. But using an LSM like selinux or a custom one to tag and enforce isolation would still be encouraged. superuser/user model made that difficult enough... now containers add exponential audit complexity to that. Who is to say that a local root does not also pierce the container model? At the moment it does. And as opposed to other virtualization approaches so far no one has been able to measure the overhead. I suspect there will be a few more cache line misses somewhere but they haven't shown up yet. If the only use was strong isolation which Dave complains about I would concur that the namespace approach is inappropriate. However there are a lot other uses. Sure there are uses. There are uses to putting the X server into the kernel, too. At some point complexity and featuritis has to take a back seat to basic sanity. Generally true, yes. -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] mm-controller
Peter Zijlstra wrote: On Fri, 2007-06-22 at 22:05 +0530, Vaidyanathan Srinivasan wrote: Merging both limits will eliminate the issue, however we would need individual limits for pagecache and RSS for better control. There are use cases for pagecache_limit alone without RSS_limit like the case of database application using direct IO, backup applications and streaming applications that does not make good use of pagecache. I'm aware that some people want this. However we rejected adding a pagecache limit to the kernel proper on grounds that reclaim should do a better job. And now we're sneaking it in the backdoor. We'll we are trying to provide a memory controller and page cache is a part of memory. The page reclaimer does treat page cache separately. Isn't this approach better than simply extending the vm_swappiness to per container? If we're going to do this, get it in the kernel proper first. I'm open to this. There were several patches to do this. We can do this by splitting the LRU list to mapped and unmapped pages or by trying to balance the page cache by tracking it's usage. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] mm-controller
Paul Menage wrote: On 6/25/07, Paul Menage [EMAIL PROTECTED] wrote: On 6/22/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote: Merging both limits will eliminate the issue, however we would need individual limits for pagecache and RSS for better control. There are use cases for pagecache_limit alone without RSS_limit like the case of database application using direct IO, backup applications and streaming applications that does not make good use of pagecache. If streaming applications would otherwise litter the pagecache with unwanted data, then limiting their total memory footprint (with a single limit) and forcing them to drop old data sooner sounds like a great idea. Actually, reading what you wrote more carefully, that's sort of what you were already saying. But it's not clear why you wouldn't also want to limit the anon pages for a job, if you're already concerned that it's not playing nicely with the rest of the system. Hi Paul, Limiting memory footprint (RSS and pagecache) for multi media applications would work. However, generally streaming applications have a fairly constant RSS size (mapped pagecache pages + ANON) while the unmapped pagecache pages is what we want to control better. If we have a combined limit for unmapped pagecache pages and RSS, then we will have to bring in vm_swappiness kind of knobs for each container to influence the per container reclaim process so as to not hurt the application performance badly. RSS controller should be able to take care of the mapped memory footprint if needed. In case of database server, moving out any of it RSS pages will hurt it performance, while we are free to shrink the unmapped pagecache pages to any smaller limit since the database is using direct IO and does not benefit from pagecache. With pagecache controller, we are able to split application's memory pages into mapped and unmapped pages. Ability to account and control unmapped pages in memory provides more possibilities for fine grain resource management. --Vaidy ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/4] sysfs: Remove first pass at shadow directory support
On Fri, Jun 22, 2007 at 01:33:42AM -0600, Eric W. Biederman wrote: While shadow directories appear to be a good idea, the current scheme of controlling their creation and destruction outside of sysfs appears to be a locking and maintenance nightmare in the face of sysfs directories dynamically coming and going. Which can now occur for directories containing network devices when CONFIG_SYSFS_DEPRECATED is not set. This patch removes everything from the initial shadow directory support that allowed the shadow directory creation to be controlled at a higher level. So except for a few bits of sysfs_rename_dir everything from commit b592fcfe7f06c15ec11774b5be7ce0de3aa86e73 is now gone. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- These patches are against 2.6.22-rc4-mm2 Hopefully that is new enough to catch all of the in flight sysfs patches. Ick, no, it isn't and doesn't apply at all :( Can you try the next -mm release? thanks, greg k-h ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 13/28] [PREP 13/14] Miscellaneous preparations in pid namespaces
Pavel Emelianov [EMAIL PROTECTED] wrote: | [EMAIL PROTECTED] wrote: | Pavel Emelianov [EMAIL PROTECTED] wrote: | | The most important one is moving exit_task_namespaces behind exit_notify | | in do_exit() to make it possible to see the task's pid namespace to | | properly notify the parent. | | Hmm. I think we tried this once a few months ago and got a crash in nfsd | See http://lkml.org/lkml/2007/1/17/75 | | [c01f6115] lockd_down+0x125/0x190 | [c01d26bd] nfs_free_server+0x6d/0xd0 | [c01d8e9c] nfs_kill_super+0xc/0x20 | [c0161c5d] deactivate_super+0x7d/0xa0 | [c0175e0e] release_mounts+0x6e/0x80 | [c0175e86] __put_mnt_ns+0x66/0x80 | [c0132b3e] free_nsproxy+0x5e/0x60 | // exit_task_namespaces() after returning from exit_notify() | [c011f021] do_exit+0x791/0x810 | [c011f0c6] do_group_exit+0x26/0x70 | [c0103142] sysenter_past_esp+0x5f/0x85 | | exit_notify() sets current-sighand to NULL and I think lockd_down() called | from exit_task_namespaces/__put_mnt_ns() was accesssing current-sighand. | | If sighand is set to NULL and then accessed then how is this related to pid namespace? Switching the order of exit_notify() and exit_task_namespaces() is what caused the problem when we did it before. If you exit_task_namespaces() before exit_notify() as the mainline code does, you won't see this bc nfsd would have freed its super by then. | | Do your other patches in this set tweak something to prevent it ? | | I think no. I'll check it for my current patches. Buried in that thread was a test case to repro the problem. Maybe that will help. | | Thats one of the reasons we had to remove pid_ns from nsproxy and use | the pid_ns from pid-upid_list[i]-pid_ns. | | Suka | ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/4] sysfs: Remove first pass at shadow directory support
Greg KH [EMAIL PROTECTED] writes: Ick, no, it isn't and doesn't apply at all :( Groan. I wonder what changed this time... Can you try the next -mm release? Ok. As soon as I get back from OLS, I will rebase against whatever is current. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel