Re: MALLOC_STATS: dump internal state and leak info via utrace(2)

2023-04-11 Thread Otto Moerbeek
On Sun, Apr 09, 2023 at 12:17:35PM +0200, Otto Moerbeek wrote:

> On Sun, Apr 09, 2023 at 10:08:25AM +0200, Claudio Jeker wrote:
> 
> > On Sun, Apr 09, 2023 at 09:15:12AM +0200, Otto Moerbeek wrote:
> > > On Sun, Apr 09, 2023 at 08:20:43AM +0200, Otto Moerbeek wrote:
> > > 
> > > > On Sun, Apr 09, 2023 at 07:53:31AM +0200, Sebastien Marie wrote:
> > > > 
> > > > > On Fri, Apr 07, 2023 at 09:52:52AM +0200, Otto Moerbeek wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > This is work in progress. I have to think if the flags to kdump 
> > > > > > > I'm
> > > > > > > introducing should be two or a single one.
> > > > > > > 
> > > > > > > Currently, malloc.c can be compiled with MALLOC_STATS defined. If 
> > > > > > > run
> > > > > > > with option D it dumps its state to a malloc.out file at exit. 
> > > > > > > This
> > > > > > > state can be used to find leaks amongst other things.
> > > > > > > 
> > > > > > > This is not ideal for pledged processes, as they often have no 
> > > > > > > way to
> > > > > > > write files.
> > > > > > > 
> > > > > > > This changes malloc to use utrace(2) for that.
> > > > > > > 
> > > > > > > As kdump has no nice way to show those lines without all extras it
> > > > > > > normally shows, so add two options to it to just show the lines.
> > > > > > > 
> > > > > > > To use, compile and install libc with MALLOC_STATS defined.
> > > > > > > 
> > > > > > > Run :
> > > > > > > 
> > > > > > > $ MALLOC_OPTIONS=D ktrace -tu your_program
> > > > > > > ...
> > > > > > > $ kdump -hu
> > > > > > > 
> > > > > > > Feedback appreciated.
> > > > > 
> > > > > I can't really comment on malloc(3) stuff, but I agree that utrace(2) 
> > > > > is a good 
> > > > > way to get information outside a pledged process.
> > > > > 
> > > > > I tend to think it is safe to use it, as the pledged process need 
> > > > > cooperation 
> > > > > from outside to exfiltrate informations (a process with permission to 
> > > > > call 
> > > > > ktrace(2) on this pid).
> > > > > 
> > > > > Please note it is a somehow generic problem: at least profiled 
> > > > > processes would 
> > > > > also get advantage of using it.
> > > > > 
> > > > > 
> > > > > Regarding kdump options, I think that -u option should implies -h (no 
> > > > > header).
> > > > > 
> > > > > Does it would make sens to considere a process using utrace(2) with 
> > > > > several 
> > > > > interleaved records for different sources ? A process with 
> > > > > MALLOC_OPTIONS=D and 
> > > > > profiling enabled for example ? An (another) option on kdump to 
> > > > > filter on utrace 
> > > > > label would be useful in such case, or have -u mandate a label to 
> > > > > filter on.
> > > > > 
> > > > > $ MALLOC_OPTIONS=D ktrace -tu your_program
> > > > > $ kdump -u mallocdumpline
> > > > > 
> > > > > and for profiling:
> > > > > 
> > > > > $ kdump -u profil > gmon.out
> > > > > $ gprof your_program gmon.out
> > > > > 
> > > > > Thanks.
> > > > 
> > > > Thanks! Your suggestions make a lot of sense. I'll rework the kdump
> > > > part to make it more flexable for different purposes.
> > > 
> > > Anothew aspect of safety is the information availble in the heap
> > > itself. I'm pretty sure the addresses of call sites into malloc are
> > > interesting to attackers. To prevent a program having access to those
> > > (even if they are stored inside the malloc meta data that is not
> > > directly accesible to a program), I'm making sure the recording only
> > > takes place if malloc option D is used.
> > > 
> > > This is included in a diff with the kdump changes and a few other
> > > things below. I'm also compiling with MALLOC_STATS if !SMALL.
> > > 
> > > usage is now:
> > > 
> > > $ MALLOC_OPTIONS=D ktrace -tu a.out 
> > > $ kdump -u malloc
> > > 
> > 
> > I would prefer if every utrace() call is a full line (in other words ulog
> > should be line buffered). It makes the regular kdump output more useable.
> > Right now you depend on kdump -u to put the lines back together.
> 
> Right. Done line buffering in the new diff below.
> 
> > Whenever I used utrace() I normally passed binary objects to the call so I
> > could enrich the ktrace with userland trace info.  So if kdump -u is used
> > for more then just mallocstats it should have a true binary mode. For
> > example gmon.out is a binary format so the above example by semarie@ would
> > not work.  As usual this can be solved in tree once that problem is hit.
> 
> Right, we can do that when needed. 
> 
> This is approaching a state where I'm happy with it. So reviews, tests
> appreciated.
> 
> I would like to have the possibility to record a caller deeper in the
> stack (as many program use wrappers to call malloc), but
> __builtin_return_address only works with a constant argument.  So some
> magic needed. Will return to that later.

New version. By default now only the leak info accumulated over all
pools is dumped. To get the full dump, use MALLOC_OPTIONS=DD


-Otto

Index: lib/libc/stdlib/malloc.3

Re: cleanup vmm_start_vm, simplifying fd cleanup

2023-04-11 Thread Dave Voutila


Philip Guenther  writes:

> On Fri, Apr 7, 2023 at 9:44 AM Dave Voutila  wrote:
> ...
>
>  Touch longer, but won't generate ktrace noise by blind calls to close(2)
>  and also accounts for the other error conditions (EINTR, EIO).
>
>  For EIO, not sure yet how we want to handle it other than log it.
>
>  For EINTR, we want to account for that race and make sure we retry since
>  the vmm process is long-lived and could inadvertently keep things like
>  tty fds or disk image fds open after the guest vm terminates.
>
> So, this is an area where
>  * the current POSIX standard leaves the behavior unspecified
>  * everyone (except HP-UX) made close() *always* invalidate the fd, even if 
> it fails with EINTR
>  * the next version of POSIX requires close() to *not* invalidate the fd if 
> it fails with EINTR:
> * in those cases where something was interrupted but close() invalidates 
> the fd and previously return EINTR
>   they are supposed to now return EINPROGRESS
> * also, a new posix_close() interface and POSIX_CLOSE_RESTART #define, 
> yadda yadda,
>  * OpenBSD has not changed to that behavior or added those interfaces at this 
> time, AND
>  * I'm not sure vmd can even hit this issue due to what it'll be closing!
>
> Last things first: when does close() return EINTR?  My understanding is that 
> the only case that happens in OpenBSD is
> interrupt of a FINAL close (no other dup()s open, etc) of a file on NFS with 
> unflushed writes.  I think it could happen
> on other systems on interrupt of close of a rewinding-tape devices (e.g. 
> /dev/rst0) but I don't actually see how that
> would happen in OpenBSD.  Are any of the fds in this case subject to those 
> sorts of concerns?  If they're all either
> duplicates in a forked process, or file types that can't block in close like 
> that (say, sockets and /dev/vmd devices or
> such), then this is all ignorable.

Right, this *should not* be a final close, *but* there's a race here
because this close(2) happens in the parent post-fork(2). If the child
process dies and any fd cleanup occurs there, the close(2) in the parent
*could* be the final, right?

These are a mix of file types: tap(4) devices, a pty(4) device, and
regular files...which may or may not be on NFS.

So if I'm following correctly, there is the edge case where this could
be a final close of a regular file on NFS, *but* at this point in vmd(8)
there aren't any writes so I think this is ignorable after all.

I'm inclined then to remove the loop. While I have an ok from mlarkin@ I
haven't committed yet due to $dayjob. Loop or no loop? I'll defer to you
guenther@.

>
> Meanwhile, the loop you're writing is correct for a future that OpenBSD has 
> not embraced.  You could kind of future-proof
> it by making the loop conditional on POSIX_CLOSE_RESTART being defined: if 
> it's defined then presumably close() has the
> new POSIX behavior, if not (the current state!) then the loop is useless or 
> actively dangerous, depending on whether the
> process is multi-threaded.

In this case, the vmm process in vmd is single-threaded.

>
> I suspect at some point we'll:
>  * map EINTR to EINPROGRESS in sys_close()
>  * #define POSIX_CLOSE_RESTART 0 /* explicitly blessed by POSIX */
>  * add to libc  int posix_close(int fd, int flags) { int ret = close(fd); 
> return flags ? EINVAL : ret; }
>
> 'cause it'll mean less patching for ports, but someone will need to check for 
> anything in tree that tries to be clever
> about close() returning EINTR.
>
> Overall , a great use of everyone's time. /s
>
> Philip



Re: patch: profiling using utrace(2) (compatible with pledge and unveil)

2023-04-11 Thread Theo de Raadt
Sebastien Marie  wrote:

> For post-execution gmon.out extraction, an helper might be needed: ktrace.out 
> could contain multiple gmon.out files.

kdump can be taught to generate the right file.



patch: profiling using utrace(2) (compatible with pledge and unveil)

2023-04-11 Thread Sebastien Marie
Hi,

After otto@ work on mallocdump using utrace(2), I started to look again at 
profiling (see moncontrol(2)).

The current implementation tries to write a gmon.out file at program exit(3) 
time, which isn't compatible with pledge(2) or unveil(2).

This diff changes the way the runtime profiling information is extracted by 
using utrace(2) (which is permitted while pledged).

The information is collected using ktrace(2), and the gmon.out file could be 
recreated from ktrace.out file post-execution (so without unveil(2) 
restriction).


One place where I needed to cheat a bit is at moncontrol(0) call in _mcleanup() 
function.

moncontrol(0) will stop profiling (change the state to GMON_PROF_OFF), and it 
is 
calling profil(2) to disable program counter profiling. _mcleanup() is the 
atexit(3) handler.

Instead of changing pledge(2) permission for profil(2) (which could go deep in 
the kernel to change the clockrate with setstatclockrate()), I just assumed 
that 
it isn't necessary to disable it here: we are in atexit(3), so the process is 
about to call _exit(2), and the kernel will stop the profiling itself if still 
running.


For post-execution gmon.out extraction, an helper might be needed: ktrace.out 
could contain multiple gmon.out files.

## compile and collect profil information (-tu option on ktrace is optional)
$ cc -static -pg test.c
$ ktrace -di -tu ./a.out

## get gmon.out file
$ kdump -u gmon.out | unvis > gmon.out

## get gmon.out.$name.$pid for multiple processes
##  - first get pid process-name
##  - extract each gmon.out for each pid and store in "gmon.out.$name.$pid" file
$ kdump -tu | sed -ne 's/^ \([0-9][0-9]*\) \([^ ]*\) .*/\1 \2/p' | sort -u \
| while read pid name; do kdump -u gmon.out -p $pid | unvis > 
gmon.out.$name.$pid ; done


kdump diff from otto@ mallocdump is need for 'kdump -u label'.


Feedback would be appreciated.
-- 
Sebastien Marie


diff /home/semarie/repos/openbsd/src
commit - 7639070d00278d5f3d76cbc265d756c39619d7a8
path + /home/semarie/repos/openbsd/src
blob - f09ffd91837040d44fb17a9e8c0197c72ba9459a
file + lib/libc/gmon/gmon.c
--- lib/libc/gmon/gmon.c
+++ lib/libc/gmon/gmon.c
@@ -29,6 +29,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -51,6 +52,32 @@ void
 PROTO_NORMAL(moncontrol);
 PROTO_DEPRECATED(monstartup);
 
+static void
+montrace(void *addr, size_t len)
+{
+   for (;;) {
+   if (len < KTR_USER_MAXLEN) {
+   if (utrace("gmon.out", addr, len) == -1)
+   ERR("error on utrace(), truncated gmon.out");
+   return;
+   }
+   if (utrace("gmon.out", addr, KTR_USER_MAXLEN) == -1)
+   ERR("error on utrace(), truncated gmon.out");
+   
+   len -= KTR_USER_MAXLEN;
+   addr += KTR_USER_MAXLEN;
+   }
+}
+
+#ifdef DEBUG
+static int
+monlog(char *msg)
+{
+   size_t len = strlen(msg);
+   return utrace("gmon.log", msg, len);
+}
+#endif
+
 void
 monstartup(u_long lowpc, u_long highpc)
 {
@@ -136,7 +163,6 @@ _mcleanup(void)
 void
 _mcleanup(void)
 {
-   int fd;
int fromindex;
int endfrom;
u_long frompc;
@@ -147,11 +173,8 @@ _mcleanup(void)
struct clockinfo clockinfo;
const int mib[2] = { CTL_KERN, KERN_CLOCKRATE };
size_t size;
-   char *profdir;
-   char *proffile;
-   char  buf[PATH_MAX];
 #ifdef DEBUG
-   int log, len;
+   int len;
char dbuf[200];
 #endif
 
@@ -169,67 +192,16 @@ _mcleanup(void)
clockinfo.profhz = clockinfo.hz;/* best guess */
}
 
-   moncontrol(0);
+   /*
+* Do not call moncontrol(0) (neither profil(2)) as we might be pledged.
+* We are in _mcleanup(), so the process is inside exit(3).
+*/
+   p->state = GMON_PROF_OFF;
 
-   if (issetugid() == 0 && (profdir = getenv("PROFDIR")) != NULL) {
-   char *s, *t, *limit;
-   pid_t pid;
-   long divisor;
-
-   /* If PROFDIR contains a null value, no profiling
-  output is produced */
-   if (*profdir == '\0') {
-   return;
-   }
-
-   limit = buf + sizeof buf - 1 - 10 - 1 -
-   strlen(__progname) - 1;
-   t = buf;
-   s = profdir;
-   while((*t = *s) != '\0' && t < limit) {
-   t++;
-   s++;
-   }
-   *t++ = '/';
-
-   /*
-* Copy and convert pid from a pid_t to a string.  For
-* best performance, divisor should be initialized to
-* the largest power of 10 less than PID_MAX.
-*/
-   pid = getpid();
-   divisor=1;
-   while (divisor > pid) divisor /= 10;/* skip leading zeros */
-   do {
-