[Devel] Re: [patch -rss] Make RSS accounting display more user friendly

2007-06-25 Thread Paul Menage

On 6/22/07, Balbir Singh [EMAIL PROTECTED] wrote:


The problem with input in bytes is that the user will have to ensure
that the input is
a  multiple of page size, which implies that she would need to use the
calculator every time.



Having input in bytes seems pretty natural to me. Why not just have
the RSS controller round the input to the nearest page (or whatever
granularity of memory the controller is able to limit at)?

Paul
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [patch -rss] Make RSS accounting display more user friendly

2007-06-25 Thread Balbir Singh
Paul Menage wrote:
 On 6/22/07, Balbir Singh [EMAIL PROTECTED] wrote:

 The problem with input in bytes is that the user will have to ensure
 that the input is
 a  multiple of page size, which implies that she would need to use the
 calculator every time.

 
 Having input in bytes seems pretty natural to me. Why not just have
 the RSS controller round the input to the nearest page (or whatever
 granularity of memory the controller is able to limit at)?
 
 Paul

Hi, Paul,

I am not a CLUI expert, but rounding off bytes will something that
the administrators will probably complain about. Since we manage
the controller memory in pages, it might be the easiest unit to use.
The output is totally different matter.

Having said that, I am not opposed to your suggestion, I'll see if
I can find good CLUI guidelines.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mm 2/2] x86_64: semi-rewrite of PTRACE_PEEKUSR, PTRACE_POKEUSR

2007-06-25 Thread Alexey Dobriyan
On Wed, Jun 20, 2007 at 02:41:48PM -0700, Roland McGrath wrote:
 What's the purpose of the change?

Chopping small bits of utrace to mainline.

regset stuff looks reasonable and self-contained enough to start with.
However, regset part in utrace contain quite a few unused things, so
I'm leaving those alone. Their time will come (or won't).

This way we can merge non-racy stuff first and leave core utrace for
later.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [NETFILTER] early_drop() imrovement (v3)

2007-06-25 Thread Patrick McHardy
Vasily Averin wrote:
 +static int early_drop(const struct nf_conntrack_tuple *orig)
 +{
 + unsigned int i, hash, cnt;
 + int ret = 0;
 +
 + hash = hash_conntrack(orig);
 + cnt = NF_CT_PER_BUCKET;
 +
 + for (i = 0;
 + !ret  cnt  i  nf_conntrack_htable_size;
 + ++i, hash = ++hash % nf_conntrack_htable_size)
 + ret = __early_drop(nf_conntrack_hash[hash], cnt);

Formatting is a bit ugly, looks much nicer as:

for (i = 0; i  nf_conntrack_htable_size; i++) {

ret = __early_drop(nf_conntrack_hash[hash], cnt);
if (ret || !cnt)
break;
hash = ++hash % nf_conntrack_htable_size;
}

 @@ -1226,7 +1243,7 @@ int __init nf_conntrack_init(void)
   if (nf_conntrack_htable_size  16)
   nf_conntrack_htable_size = 16;
   }
 - nf_conntrack_max = 8 * nf_conntrack_htable_size;
 + nf_conntrack_max = NF_CT_PER_BUCKET * nf_conntrack_htable_size;


I don't like the NF_CT_PER_BUCKET constant. First of all, each
conntrack is hashed twice, so its really only 1/2 of the average
conntracks per bucket. Secondly, its only a default and many
people use nf_conntrack_max = nf_conntrack_htable_size / 2, so
using this constant for early_drop seems wrong.

Perhaps make it 2 * nf_conntrack_max / nf_conntrack_htable_size
or even add a nf_conntrack_eviction_range sysctl.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD] L2 Network namespace infrastructure

2007-06-25 Thread Serge E. Hallyn
Quoting David Miller ([EMAIL PROTECTED]):
 From: [EMAIL PROTECTED] (Eric W. Biederman)
 Date: Sat, 23 Jun 2007 11:19:34 -0600
 
  Further and fundamentally all a global achieves is removing the need
  for the noise patches where you pass the pointer into the various
  functions.  For long term maintenance it doesn't help anything.
 
 I don't accept that we have to add another function argument
 to a bunch of core routines just to support this crap,
 especially since you give no way to turn it off and get
 that function argument slot back.
 
 To be honest I think this form of virtualization is a complete
 waste of time, even the openvz approach.
 
 We're protecting the kernel from itself, and that's an endless
 uphill battle that you will never win.  Let's do this kind of

Hi David,

just to be clear this isn't so much about security.  Security can be
provided using selinux, just as with the userid namespace.  But like
with the userid namespace, this provides usability for the virtual
servers, plus some support for restarting checkpointed applications.

That doesn't attempt to justify the extra argument - if you don't
like it, you don't like it  :)  Just wanted to clarify.

thanks,
-serge

 stuff properly with a real minimal hypervisor, hopefully with
 appropriate hardware level support and good virtualized device
 interfaces, instead of this namespace stuff.
 
 At least the hypervisor approach you have some chance to fully
 harden in some verifyable and truly protected way, with
 namespaces it's just a pipe dream and everyone who works on
 these namespace approaches knows that very well.
 
 The only positive thing that came out of this work is the
 great auditing that the openvz folks have done and the bugs
 they have found, but it basically ends right there.
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD] L2 Network namespace infrastructure

2007-06-25 Thread Serge E. Hallyn
Quoting Jeff Garzik ([EMAIL PROTECTED]):
 Eric W. Biederman wrote:
 Jeff Garzik [EMAIL PROTECTED] writes:
 
 David Miller wrote:
 I don't accept that we have to add another function argument
 to a bunch of core routines just to support this crap,
 especially since you give no way to turn it off and get
 that function argument slot back.
 
 To be honest I think this form of virtualization is a complete
 waste of time, even the openvz approach.
 
 We're protecting the kernel from itself, and that's an endless
 uphill battle that you will never win.  Let's do this kind of
 stuff properly with a real minimal hypervisor, hopefully with
 appropriate hardware level support and good virtualized device
 interfaces, instead of this namespace stuff.
 Strongly seconded.  This containerized virtualization approach just 
 bloats up
 the kernel for something that is inherently fragile and IMO less secure --
 protecting the kernel from itself.
 
 Plenty of other virt approaches don't stir the code like this, while
 simultaneously providing fewer, more-clean entry points for the 
 virtualization
 to occur.
 
 Wrong.  I really don't want to get into a my virtualization approach is 
 better
 then yours.  But this is flat out wrong.
 
 99% of the changes I'm talking about introducing are just:
 - variable 
 + ptr-variable
 
 There are more pieces mostly with when we initialize those variables but
 that is the essence of the change.
 
 You completely dodged the main objection.  Which is OK if you are 
 selling something to marketing departments, but not OK
 
 Containers introduce chroot-jail-like features that give one a false 
 sense of security, while still requiring one to poke holes in the 
 illusion to get hardware-specific tasks accomplished.
 
 The capable/not-capable model (i.e. superuser / normal user) is _still_ 
 being secured locally, even after decades of work and whitepapers and 
 audits.
 
 You are drinking Deep Kool-Aid if you think adding containers to the 
 myriad kernel subsystems does anything besides increasing fragility, and 
 decreasing security.  You are securing in-kernel subsystems against 
 other in-kernel subsystems.

No we're not.  As the name 'network namespaces' implies, we are
introducing namespaces for network-related variables.  That's it.

We are not trying to protect in-kernel subsystems from each other.  In
fact we're not even trying to protect userspace process from each other.
Though that will in part come free when user processes can't access each
other's data because they are in different namespaces.  But using an
LSM like selinux or a custom one to tag and enforce isolation would
still be encouraged.

 superuser/user model made that difficult 
 enough... now containers add exponential audit complexity to that.  Who 
 is to say that a local root does not also pierce the container model?

At the moment it does.

 And as opposed to other virtualization approaches so far no one has been
 able to measure the overhead.  I suspect there will be a few more cache
 line misses somewhere but they haven't shown up yet.
 
 If the only use was strong isolation which Dave complains about I would
 concur that the namespace approach is inappropriate.  However there are
 a lot other uses.
 
 Sure there are uses.  There are uses to putting the X server into the 
 kernel, too.  At some point complexity and featuritis has to take a back 
 seat to basic sanity.

Generally true, yes.

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] mm-controller

2007-06-25 Thread Balbir Singh
Peter Zijlstra wrote:
 On Fri, 2007-06-22 at 22:05 +0530, Vaidyanathan Srinivasan wrote:
 
 Merging both limits will eliminate the issue, however we would need
 individual limits for pagecache and RSS for better control.  There are
 use cases for pagecache_limit alone without RSS_limit like the case of
 database application using direct IO, backup applications and
 streaming applications that does not make good use of pagecache.
 
 I'm aware that some people want this. However we rejected adding a
 pagecache limit to the kernel proper on grounds that reclaim should do a
 better job.
 
 And now we're sneaking it in the backdoor.
 


We'll we are trying to provide a memory controller and page cache is a
part of memory. The page reclaimer does treat page cache separately.
Isn't this approach better than simply extending the vm_swappiness
to per container?


 If we're going to do this, get it in the kernel proper first.
 

I'm open to this. There were several patches to do this. We can do
this by splitting the LRU list to mapped and unmapped pages or
by trying to balance the page cache by tracking it's usage.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] mm-controller

2007-06-25 Thread Vaidyanathan Srinivasan


Paul Menage wrote:
 On 6/25/07, Paul Menage [EMAIL PROTECTED] wrote:
 On 6/22/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote:
 Merging both limits will eliminate the issue, however we would need
 individual limits for pagecache and RSS for better control.  There are
 use cases for pagecache_limit alone without RSS_limit like the case of
 database application using direct IO, backup applications and
 streaming applications that does not make good use of pagecache.

 If streaming applications would otherwise litter the pagecache with
 unwanted data, then limiting their total memory footprint (with a
 single limit) and forcing them to drop old data sooner sounds like a
 great idea.
 
 Actually, reading what you wrote more carefully, that's sort of what
 you were already saying. But it's not clear why you wouldn't also want
 to limit the anon pages for a job, if you're already concerned that
 it's not playing nicely with the rest of the system.

Hi Paul,

Limiting memory footprint (RSS and pagecache) for multi media
applications would work.  However, generally streaming applications
have a fairly constant RSS size (mapped pagecache pages + ANON) while
the unmapped pagecache pages is what we want to control better.

If we have a combined limit for unmapped pagecache pages and RSS, then
 we will have to bring in vm_swappiness kind of knobs for each
container to influence the per container reclaim process so as to not
hurt the application performance badly.

RSS controller should be able to take care of the mapped memory
footprint if needed.  In case of database server, moving out any of it
RSS pages will hurt it performance, while we are free to shrink the
unmapped pagecache pages to any smaller limit since the database is
using direct IO and does not benefit from pagecache.

With pagecache controller, we are able to split application's memory
pages into mapped and unmapped pages. Ability to account and control
unmapped pages in memory provides more possibilities for fine grain
resource management.

--Vaidy

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/4] sysfs: Remove first pass at shadow directory support

2007-06-25 Thread Greg KH
On Fri, Jun 22, 2007 at 01:33:42AM -0600, Eric W. Biederman wrote:
 
 While shadow directories appear to be a good idea, the current scheme
 of controlling their creation and destruction outside of sysfs appears
 to be a locking and maintenance nightmare in the face of sysfs
 directories dynamically coming and going.  Which can now occur for
 directories containing network devices when CONFIG_SYSFS_DEPRECATED is
 not set.
 
 This patch removes everything from the initial shadow directory
 support that allowed the shadow directory creation to be controlled at
 a higher level.  So except for a few bits of sysfs_rename_dir
 everything from commit b592fcfe7f06c15ec11774b5be7ce0de3aa86e73 is now
 gone.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
 These patches are against 2.6.22-rc4-mm2  Hopefully that is new enough
 to catch all of the in flight sysfs patches.

Ick, no, it isn't and doesn't apply at all :(

Can you try the next -mm release?

thanks,

greg k-h
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 13/28] [PREP 13/14] Miscellaneous preparations in pid namespaces

2007-06-25 Thread sukadev
Pavel Emelianov [EMAIL PROTECTED] wrote:
| [EMAIL PROTECTED] wrote:
|  Pavel Emelianov [EMAIL PROTECTED] wrote:
|  | The most important one is moving exit_task_namespaces behind exit_notify
|  | in do_exit() to make it possible to see the task's pid namespace to
|  | properly notify the parent.
|  
|  Hmm. I think we tried this once a few months ago and got a crash in nfsd
|  See http://lkml.org/lkml/2007/1/17/75
|  
|  [c01f6115] lockd_down+0x125/0x190
|  [c01d26bd] nfs_free_server+0x6d/0xd0
|  [c01d8e9c] nfs_kill_super+0xc/0x20
|  [c0161c5d] deactivate_super+0x7d/0xa0
|  [c0175e0e] release_mounts+0x6e/0x80
|  [c0175e86] __put_mnt_ns+0x66/0x80
|  [c0132b3e] free_nsproxy+0x5e/0x60
|  // exit_task_namespaces() after returning from exit_notify()
|  [c011f021] do_exit+0x791/0x810
|  [c011f0c6] do_group_exit+0x26/0x70
|  [c0103142] sysenter_past_esp+0x5f/0x85
|  
|  exit_notify() sets current-sighand to NULL and I think lockd_down() called
|  from exit_task_namespaces/__put_mnt_ns() was accesssing current-sighand.
| 
| If sighand is set to NULL and then accessed then how is this related to pid 
namespace? 

Switching the order of exit_notify() and exit_task_namespaces() is what
caused the problem when we did it before.

If you exit_task_namespaces() before exit_notify() as the mainline code
does, you won't see this bc nfsd would have freed its super by then.

| 
|  Do your other patches in this set tweak something to prevent it ?
| 
| I think no. I'll check it for my current patches.

Buried in that thread was a test case to repro the problem. Maybe that
will help.

| 
|  Thats one of the reasons we had to remove pid_ns from nsproxy and use
|  the pid_ns from pid-upid_list[i]-pid_ns.
|  
|  Suka
|  
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/4] sysfs: Remove first pass at shadow directory support

2007-06-25 Thread Eric W. Biederman
Greg KH [EMAIL PROTECTED] writes:

 Ick, no, it isn't and doesn't apply at all :(

Groan.  I wonder what changed this time...

 Can you try the next -mm release?

Ok. As soon as I get back from OLS, I will rebase against
whatever is current.

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel