2015-04-17 17:45 GMT-03:00 Paulo César Pereira de Andrade
<paulo.cesar.pereira.de.andr...@gmail.com>:
> I have a user with a ksh crashing problem, and that has
> some "Write error: No space left on device" messages
> in /var/log/messages.
>
> After some debugging, and creating a chroot on a file
> disk image, and a test user, and slowly filling the
> "on file" filesystem, e.g.
>
> dd if=/dev/zero of=/mnt/tmp/zerosN bs=1M count=1024
> dd if=/dev/zero of=/mnt/tmp/zerosN bs=1K count=2
>
> until leaving just around 12K, I managed to reproduce the
> problem, and be able to debug it with valgrind and vgdb;
> debugging on these conditions is tricky, as cannot tell
> valgrind to spawn gdb, because then gdb itself would fail
> to start.

  Attached is the best test case I could come to so far,
other than creating a chroot on some kind of (loop)
filesystem and leaving only a few kb free.
Maybe it is possible to somehow get one without needing
LD_PRELOAD, and using /dev/full somehow.

  To test:

$ gcc -fPIC  -shared write.c -o write.so -ldl -D_GNU_SOURCE=1
$ LD_PRELOAD=$PWD/write.so ksh -l
/etc/profile: line 45: write to 3 failed [No space left on device]
Memory fault(coredump)

If using the proposed patch, it will hang in a pause call:

$ LD_PRELOAD=$PWD/write.so
/home/pcpa/rhel/ksh/ksh-20120801/arch/linux.i386-64/bin/ksh -l
/etc/profile: line 45: write to 3 failed [No space left on device]
^C$

and after ^C it returns to the main shell.

> So, after following the code enough, I learned that at places
> it handles SH_JMPEXIT, there was almost non existing
> handling of SH_JMPERREXIT.
>
> ksh would evently cause a crash due to the struct
> subshell allocated on stack, in sh/subshell.c:sh_subshell
> kept set to the global subshell_data, after it siglongjmp
> back the stack due to, not fully handling the out of disk
> space errors. It would print a few messages, everytime
> a pipe was created, e.g.:
>
> /etc/profile: line 28: write to 3 failed [No space left on device]
>
> until eventually crashing due to corrupted memory; e.g. the
> references to stack data from sh_subsell in the global
> subshell_data. One strange thing to me in coredump analysis
> was that subshell_data prev field was pointing to itself when
> it eventually crashed, what later was understood and expected...
>
> The attached patch handles SH_JMPERREXIT in the code
> paths SH_JMPEXIT is handled, and the failed login, on
> full disk, ends in a pause() call:
>
> ---terminal 1---
> $ valgrind -q --leak-check=full --free-fill=0x5a --vgdb=full
> --vgdb-error=0 /bin/ksh -l
> ==17730== (action at startup) vgdb me ...
> ==17730==
> ==17730== TO DEBUG THIS PROCESS USING GDB: start GDB like this
> ==17730==   /path/to/gdb /bin/ksh
> ==17730== and then give GDB the following command
> ==17730==   target remote | /usr/lib64/valgrind/../../bin/vgdb --pid=17730
> ==17730== --pid is optional if only one valgrind process is running
> ==17730==
> ==17730== Syscall param mount(type) points to unaddressable byte(s)
> ==17730==    at 0x563377A: mount (in /usr/lib64/libc-2.17.so)
> ==17730==    by 0x493E58: fs3d_mount (fs3d.c:115)
> ==17730==    by 0x493C8B: fs3d (fs3d.c:57)
> ==17730==    by 0x423E41: sh_init (init.c:1302)
> ==17730==    by 0x405CD3: sh_main (main.c:141)
> ==17730==    by 0x405B84: main (pmain.c:45)
> ==17730==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> ==17730==
> ==17730== (action on error) vgdb me ...
> ==17730== Continuing ...
> /etc/profile: line 28: write to 3 failed [No space left on device]
> ---8<---
>
> ---terminal 2---
> (gdb) c
> Continuing.
> ^C
> Program received signal SIGTRAP, Trace/breakpoint trap.
> 0x00000000055fa470 in __pause_nocancel () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00000000055fa470 in __pause_nocancel () from /lib64/libc.so.6
> #1  0x000000000041e73d in sh_done (ptr=0x793360 <sh>, sig=255) at
> /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/fault.c:665
> #2  0x0000000000407407 in exfile (shp=0x4542, iop=0xff, fno=0) at
> /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/main.c:604
> #3  0x0000000000405c43 in sh_source (shp=0x793360 <sh>, iop=0x0,
> file=0x524804 <e_sysprofile> "/etc/profile")
>     at /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/main.c:109
> #4  0x00000000004060e4 in sh_main (ac=2, av=0xfff000498, userinit=0x0)
> at /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/main.c:202
> #5  0x0000000000405b85 in main (argc=2, argv=0xfff000498) at
> /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/pmain.c:45
> (gdb)
> ---8<---
>
> Thanks,
> Paulo
// gcc -fPIC  -shared write.c -o write.so -ldl -D_GNU_SOURCE=1

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <dlfcn.h>
#include <unistd.h>
#include <errno.h>

ssize_t write(int fd, const void *buf, size_t count)
{
    static ssize_t (*real_write)(int, const void*, size_t) = (void*)0;
    if (!real_write)
	real_write = dlsym(RTLD_NEXT, "write");
    if (fd == 3) {
	errno = ENOSPC;
	return -1;
    }
    return real_write(fd, buf, count);
}
_______________________________________________
ast-users mailing list
ast-users@lists.research.att.com
http://lists.research.att.com/mailman/listinfo/ast-users

Reply via email to