Re: [Lxc-users] Kernel 2.6.33-rc6, 3 bugs container specific.

2010-02-04 Thread Daniel Lezcano
Serge E. Hallyn wrote:
 Quoting Daniel Lezcano (daniel.lezc...@free.fr):
   
 Serge E. Hallyn wrote:
 
 Quoting Jean-Marc Pigeon (j...@safe.ca):
   
 Hello,


 
 I was wondering out loud about the best design to solve his problem.

 If we try to redirect kernel-generated messages to containers, we have
 several problems, including whether we need to duplicate the messages
 to the host container.  So in one sense it seems more flexible to
   1. send everything to host syslog
   
No, if we do that all CONTs message will reach
the same bucket and it will be difficult to sort
them out..
CONT sys_admin and HOST sys_admin could be different
entity, so you debug CONT config and critical
needed information reach HOST (which you do not 
 have access
 to).
 
 Yes, so a privileged task on HOST must pass that information back to
 you on CONT.  That is not a valid complaint imo.  But how to sort the
 msgs out is a valid question.

 We need some sort of identifier, unique system-wide, attached to.. 
 something.
 Is ifindex unique system-wide right now?  Oh, IIRC it is, but we wnat it to
 be containerized, so that would be a bad choice :)

   
   2. clamp down on syslog use by processes not in the init_user_ns
   
Could give me more detail??...
 
 Simplest choices would be to just refuse sys_syslog() and open(/proc/kmsg)
 altogether from a container, or to only allow reading/writing messages
 to own syslog.  (I had hoped to find time to try out the second option but
 simply haven't had the time, and it doesn't look like I will very soon.
 So if anyone else wants to, pls jump at it...)

 Then /proc/kmsg can provide what I described above through a FUSE file,
 and if, as you mentioned, the container unmounts the FUSE fs and gets
 to real procfs, they just get nothing.

   
   3. let the userspace on the host copy messages into a socket or
  file so child container can pretend it has real syslog.
   
So you trap printk message from CONT on the HOST and
redirect them on CONT but on a standard syslog channel.
Seem OK to me, as long /proc/kmsg is not existing
(/dev/null) in the CONT file tree.
 
 We have:
* Commands to sys_syslog:
*
*  0 -- Close the log.  Currently a NOP.
*  1 -- Open the log. Currently a NOP.
*  2 -- Read from the log.
*  3 -- Read all messages remaining in the ring buffer.
*  4 -- Read and clear all messages remaining in the ring buffer
*  5 -- Clear ring buffer.
*  6 -- Disable printk to console
*  7 -- Enable printk to console
*  8 -- Set level of messages printed to console
*  9 -- Return number of unread characters in the log buffer
* 10 -- Return size of the log buffer

 And add:
   * 11 -- create a new ring buffer for the current process
 and its childs


 We have, let's say a global ring buffer keep untouched, used by
 syslog(2) and printk. When we create a new ring buffer, we allocate
 it and assign to the nsproxy (global ring buffer is the default in
 the nsproxy).

 The prink keeps writing in the global ring buffer and the syslog(2)
 writes to the namespaced ring buffer.

 Does it makes sense ?
 

 Yeah, it's a nice alternative.  Though (1) there is something to be said for
 forcing a new ring buffer upon clone(CLONE_NEWUSER), and (2) assuming the
 new ring buffer is pointed to from nsproxy, it might be frowned upon to do
 an unshare/clone action in yet another way.
   
Why do you want to tie clone(CLONE_NEWUSER) with a new ring buffer ?
I mean one may want to use CLONE_NEWUSER but keep the ring buffer, no ?
 I still think our first concern should be safety, and that we should consider
 just adding 'struct syslog_struct' to nsproxy, and making that NULL on a
 clone(CLONE_NEWUSER).  any sys_syslog() or /proc/kmsg access returns -EINVAL
 after that.  Then we can discuss whether and how to target printks to
 namespaces, and whether duplicates should be sent to parent namespaces.
   
That makes sense to do it step by step. Targeting the printk is the more 
difficult, no ? I mean you should have always the destination namespace 
available which is not obvious when the printk is called from an 
interrupt context.

 After we start getting flexible with syslog, the next request will be for
 audit flexibility.  I don't even know how our netlink support suffices for
 that right now.

 (So, this all does turn into a big deal...)
   
Mmh ... right.

--
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without 

Re: [Lxc-users] [PATCH RFC][gross hack] containerized syslog

2010-02-04 Thread Serge E. Hallyn
Another reason this is bogus, confirmed by Jean-Marc's testing:
printk is called too often without valid 'current', i.e. when
network packets arrive and are processed.  So we should send output
of sys_syslog() to current's ring buffer, all printks to the
init_syslog_ns, and then we can use ns_printk(syslog_ns, fmt, ...)
for targeted printks.

And, as discussed on irc, we'll make syslog_ns a full namespace
in its own right, use the last remaining clone flag (if there is
one) or build on top of eclone().  It'd be nicer to have a 'real'
clone flag so we can also unshare(CLONE_NEWLOG).

-serge

Quoting Serge E. Hallyn (se...@us.ibm.com):
 Provide each user namespace with its own syslog ringbuffer.
 
 So you can do
   ns_exec -cU /bin/bash
   dmesg
 and see nothing.  Root in a container (with private user namespace)
 cannot clear the host's ring buffer.
 
 Since containers do not have a notion of consoles at present,
 only the initial user namespace deals with console output or
 with the console-related syslog commands.
 
 This opens the door to targetting printk at certain syslog
 namespaces.  It's not safe to be applied - it's a quick-n-dirty
 hack and won't even compile for CONFIG_PRINTK=n.  Also I've not decided
 what to do about duplication of printks to init_user_ns so for
 now emit_one_char always duplicates to inti_user_ns.  We probably
 want to be smarter about this and output a prefix indicating the
 target.
 
 But I figured discussions about the API would be more meaningful
 with a testable patch.
 
 ---
  fs/proc/kmsg.c |5 +-
  include/linux/user_namespace.h |2 +
  kernel/printk.c|  225 
 ++--
  kernel/user.c  |4 +
  kernel/user_namespace.c|   13 +++
  5 files changed, 168 insertions(+), 81 deletions(-)
 
 diff --git a/fs/proc/kmsg.c b/fs/proc/kmsg.c
 index 7ca7834..2746b70 100644
 --- a/fs/proc/kmsg.c
 +++ b/fs/proc/kmsg.c
 @@ -12,11 +12,12 @@
  #include linux/poll.h
  #include linux/proc_fs.h
  #include linux/fs.h
 +#include linux/syslog.h
 
  #include asm/uaccess.h
  #include asm/io.h
 
 -extern wait_queue_head_t log_wait;
 +extern struct syslog_ns init_syslog_ns;
 
  extern int do_syslog(int type, char __user *bug, int count);
 
 @@ -41,7 +42,7 @@ static ssize_t kmsg_read(struct file *file, char __user 
 *buf,
 
  static unsigned int kmsg_poll(struct file *file, poll_table *wait)
  {
 - poll_wait(file, log_wait, wait);
 + poll_wait(file, init_syslog_ns.wait, wait);
   if (do_syslog(9, NULL, 0))
   return POLLIN | POLLRDNORM;
   return 0;
 diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
 index cc4f453..3926c89 100644
 --- a/include/linux/user_namespace.h
 +++ b/include/linux/user_namespace.h
 @@ -5,6 +5,7 @@
  #include linux/nsproxy.h
  #include linux/sched.h
  #include linux/err.h
 +#include linux/syslog.h
 
  #define UIDHASH_BITS (CONFIG_BASE_SMALL ? 3 : 8)
  #define UIDHASH_SZ   (1  UIDHASH_BITS)
 @@ -14,6 +15,7 @@ struct user_namespace {
   struct hlist_head   uidhash_table[UIDHASH_SZ];
   struct user_struct  *creator;
   struct work_struct  destroyer;
 + struct syslog_ns*syslog;
  };
 
  extern struct user_namespace init_user_ns;
 diff --git a/kernel/printk.c b/kernel/printk.c
 index 1751c45..5b93447 100644
 --- a/kernel/printk.c
 +++ b/kernel/printk.c
 @@ -35,9 +35,18 @@
  #include linux/kexec.h
  #include linux/ratelimit.h
  #include linux/kmsg_dump.h
 +#include linux/user_namespace.h
 
  #include asm/uaccess.h
 
 +struct syslog_ns init_syslog_ns;
 +#define g_log_wait (init_syslog_ns.wait)
 +#define g_log_start (init_syslog_ns.start)
 +#define g_log_end (init_syslog_ns.end)
 +#define g_log_buf_len (init_syslog_ns.buf_len)
 +#define g_logged_chars (init_syslog_ns.logged_chars)
 +#define g_log_buf (init_syslog_ns.buf)
 +
  /*
   * for_each_console() allows you to iterate on each console
   */
 @@ -52,6 +61,7 @@ void asmlinkage __attribute__((weak)) early_printk(const 
 char *fmt, ...)
  }
 
  #define __LOG_BUF_LEN(1  CONFIG_LOG_BUF_SHIFT)
 +#define  CONTAINER_BUF_LEN 4096
 
  /* printk's without a loglevel use this.. */
  #define DEFAULT_MESSAGE_LOGLEVEL 4 /* KERN_WARNING */
 @@ -60,8 +70,6 @@ void asmlinkage __attribute__((weak)) early_printk(const 
 char *fmt, ...)
  #define MINIMUM_CONSOLE_LOGLEVEL 1 /* Minimum loglevel we let people use */
  #define DEFAULT_CONSOLE_LOGLEVEL 7 /* anything MORE serious than KERN_DEBUG 
 */
 
 -DECLARE_WAIT_QUEUE_HEAD(log_wait);
 -
  int console_printk[4] = {
   DEFAULT_CONSOLE_LOGLEVEL,   /* console_loglevel */
   DEFAULT_MESSAGE_LOGLEVEL,   /* default_message_loglevel */
 @@ -98,22 +106,20 @@ EXPORT_SYMBOL_GPL(console_drivers);
  static int console_locked, console_suspended;
 
  /*
 - * logbuf_lock protects log_buf, log_start, log_end, con_start and 
 logged_chars
 + * logbuf_lock protects g_log_buf, g_log_start, g_log_end,