Re: [nvi-iconv]Call for test

2011-08-17 Thread Zhihao Yuan
I totally hate gmail's reply -- I enabled Reply to all by default
but I still got things wrong.

Anyway, for short, the problem is caused by the lack of a widechar
enabled regex. I ported the one used by nvi-devel-1.8x.

A new patch is uploaded,
https://github.com/downloads/lichray/nvi2/nvi2-freebsd-2011-08-17.diff.gz
and I tested it with make buildworld.

Note that this version sets WARNS=1 in vi's Makefile, since it's
warning free with clang and gcc.

And there is change to `rescue`'s compilation: now it links to
libcursesw if WITH_ICONV is on.

On Tue, Aug 16, 2011 at 5:56 PM, Test Rat ttse...@gmail.com wrote:
 Zhihao Yuan lich...@gmail.com writes:

 On Sun, Aug 14, 2011 at 10:39 AM, Zhihao Yuan lich...@gmail.com wrote:
 Hi, hackers:

 My GSoC2011 project, Multibyte Encoding Support in Nvi is ready for
 testing. The proposal of the project is here:
 http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/zy/1
 [...]
 Let me try to ``quickly'' explain how to involve into the testing.

 First, download the patch from
 https://github.com/downloads/lichray/nvi2/nvi2-freebsd-2011-08-14.diff.gz

 It breaks buildworld for me, e.g.

  $ make all -C share/termcap
  gzip -cn /usr/src/share/termcap/termcap.5  termcap.5.gz
  TERM=dumb TERMCAP=dumb: ex - /usr/src/share/termcap/termcap.src  
 /usr/src/share/termcap/reorder
  Error: stderr: Inappropriate ioctl for device
  script, 3: Destination line is inside move range
  *** Error code 1

 and crashes when no WITH_ICONV is defined. Can you confirm?

  Starting program: /usr/bin/ex - /usr/src/share/termcap/termcap.src  
 /usr/src/share/termcap/reorder

  Program received signal SIGSEGV, Segmentation fault.
  0x000800be7760 in ?? ()
  (gdb) bt
  #0  0x000800be7760 in ?? ()
  #1  0x0044092b in ex_writefp (sp=0x801106800, name=0x801103148 
 termcap, fp=0x800e51d90, fm=0x801007ca8, tm=0x801007cb8,
      nlno=0x7fffc808, nch=0x7fffc800, silent=0) at 
 /usr/src/usr.bin/vi/../../contrib/nvi2/ex/ex_write.c:321
  #2  0x0040bfb2 in file_write (sp=0x801106800, fm=0x801007ca8, 
 tm=0x801007cb8, name=0x801103148 termcap, flags=21)
      at /usr/src/usr.bin/vi/../../contrib/nvi2/common/exf.c:924
  #3  0x00440739 in exwr (sp=0x801106800, cmdp=0x801007be8, cmd=WRITE)
      at /usr/src/usr.bin/vi/../../contrib/nvi2/ex/ex_write.c:264
  #4  0x004400d2 in ex_write (sp=0x801106800, cmdp=0x801007be8) at 
 /usr/src/usr.bin/vi/../../contrib/nvi2/ex/ex_write.c:91
  #5  0x00422b78 in ex_cmd (sp=0x801106800) at 
 /usr/src/usr.bin/vi/../../contrib/nvi2/ex/ex.c:1375
  #6  0x0041f788 in ex (spp=0x7fffd040) at 
 /usr/src/usr.bin/vi/../../contrib/nvi2/ex/ex.c:133
  #7  0x00412377 in editor (gp=0x801007b00, argc=1, 
 argv=0x7fffd268)
      at /usr/src/usr.bin/vi/../../contrib/nvi2/common/main.c:424
  #8  0x0040513f in main (argc=3, argv=0x7fffd258) at 
 /usr/src/usr.bin/vi/../../contrib/nvi2/cl/cl_main.c:123
  (gdb) bt f
  #0  0x000800be7760 in ?? ()
  No symbol table info available.
  #1  0x0044092b in ex_writefp (sp=0x801106800, name=0x801103148 
 termcap, fp=0x800e51d90, fm=0x801007ca8, tm=0x801007cb8,
      nlno=0x7fffc808, nch=0x7fffc800, silent=0) at 
 /usr/src/usr.bin/vi/../../contrib/nvi2/ex/ex_write.c:321
          sb = {
    st_dev = 4294955600,
    st_ino = 32767,
    st_mode = 0,
    st_nlink = 0,
    st_uid = 0,
    st_gid = 1944,
    st_rdev = 0,
    st_atim = {
      tv_sec = 3,
      tv_nsec = 34377892735
    },
    st_mtim = {
      tv_sec = 6798080,
      tv_nsec = 4096
    },
    st_ctim = {
      tv_sec = 0,
      tv_nsec = 34366769152
    },
    st_size = 4,
    st_blocks = 140737488343672,
    st_blksize = 3,
    st_flags = 0,
    st_gen = 4294954160,
    st_lspare = 32767,
    st_birthtim = {
      tv_sec = 34366543277,
      tv_nsec = 582
    }
  }
          gp = (GS *) 0x801007b00
          ccnt = 0
          fline = 1
          tline = 4666
          lcnt = 0
          len = 140737488340496
          rval = -11656
          msg = 0x46f540 253|Writing...
          p = (CHAR_T *) 0xb Error reading address 0xb: Bad address
          f = 0x800c0981f H\211A^]\017\037\204
          flen = 140737488340168
          isutf16 = 0
  #2  0x0040bfb2 in file_write (sp=0x801106800, fm=0x801007ca8, 
 tm=0x801007cb8, name=0x801103148 termcap, flags=21)
      at /usr/src/usr.bin/vi/../../contrib/nvi2/common/exf.c:924
          mtype = OLDFILE
          sb = {
    st_dev = 745804815,
    st_ino = 70073,
    st_mode = 33188,
    st_nlink = 1,
    st_uid = 1001,
    st_gid = 1001,
    st_rdev = 4294967295,
    st_atim = {
      tv_sec = 1313534959,
      tv_nsec = 905174484
    },
    st_mtim = {
      tv_sec = 1313535150,
      tv_nsec = 420174354
    },
    st_ctim = {
      tv_sec = 1313535150,
      tv_nsec = 420174354
    },
    st_size = 0,
    st_blocks = 1,
    st_blksize = 131072,
    st_flags = 0,
    st_gen = 0,
    st_lspare = 0,
    st_birthtim = {
    

Re: building my own release images from -CURRENT

2011-08-17 Thread Sean Bruno
On Tue, 2011-08-16 at 15:15 -0700, Test Rat wrote:
 Sean Bruno sean...@yahoo-inc.com writes:
 
  Just trying some hackery with building my own release images from
  -CURRENT today.
 
  I've built world and my kernel, when I enter release and make memstick
  I get the following:
 
  ** Creating the temporary root environment in /var/tmp/temproot
   *** /var/tmp/temproot ready for use
   *** Creating and populating directory structure in /var/tmp/temproot
 
 
*** FATAL ERROR: Cannot 'cd' to /usr/src and install files to
the temproot environment
 
  *** Error code 1
 
  Stop in /dumpster/scratch/sbruno-scratch/test/release.
 
 Where points /usr/src? Have you tried the patch in misc/159666 ?

Well, wow.  Ok, misc/159666 does indeed fix this problem.  I'm very
confused by the fact that it has not come up before.

Let me cc r...@freebsd.org and see if we can get a commit for this.

Sean


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: debugging frequent kernel panics on 8.2-RELEASE

2011-08-17 Thread Andriy Gapon

Thanks to the debug that Steven provided and to the help that I received from
Kostik, I think that now I understand the basic mechanics of this panic, but,
unfortunately, not the details of its root cause.

It seems like everything starts with some kind of a race between terminating
processes in a jail and termination of the jail itself.  This is where the
details are very thin so far.  What we see is that a process (http) is in
exit(2) syscall, in exit1() function actually, and past the place where P_WEXIT
flag is set and even past the place where p_limit is freed and reset to NULL.
At that place the thread calls prison_proc_free(), which calls prison_deref().
Then, we see that in prison_deref() the thread gets a page fault because of what
seems like a NULL pointer dereference.  That's just the start of the problem and
its root cause.

Then, trap_pfault() gets invoked and, because addresses close to NULL look like
userspace addresses, vm_fault/vm_fault_hold gets called, which in its turn goes
on to call vm_map_growstack.  First thing that vm_map_growstack does is a call
to lim_cur(), but because p_limit is already NULL, that call results in a NULL
pointer dereference and a page fault.  Goto the beginning of this paragraph.

So we get this recursion of sorts, which only ends when a stack is exhausted and
a CPU generates a double-fault.

So, of course, Steven is interested in finding and fixing the root cause.  I
hope we will get to that with some help from the prison guards :-)

But I also would like to use this opportunity to discuss how we can make it
easier to debug such issue as this.  I think that this problem demonstrates that
when we treat certain junk in kernel address value as a userland address value,
we throw additional heaps of irrelevant stuff on top of an actual problem. One
solution could be to use a special flag that would mark all actual attempts to
access userland address (e.g. setting the flag on entrance to copyin and
clearing it upon return), so that in the page fault handler we could distinguish
actual faults on userland addresses from faults on garbage kernel addresses.  I
am sure that there could be other clever techniques to catch such garbage
addresses early.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: debugging frequent kernel panics on 8.2-RELEASE

2011-08-17 Thread Kostik Belousov
On Wed, Aug 17, 2011 at 11:21:42PM +0300, Andriy Gapon wrote:
[skip]

 But I also would like to use this opportunity to discuss how we can
 make it easier to debug such issue as this. I think that this problem
 demonstrates that when we treat certain junk in kernel address value
 as a userland address value, we throw additional heaps of irrelevant
 stuff on top of an actual problem. One solution could be to use a
 special flag that would mark all actual attempts to access userland
 address (e.g. setting the flag on entrance to copyin and clearing it
 upon return), so that in the page fault handler we could distinguish
 actual faults on userland addresses from faults on garbage kernel
 addresses. I am sure that there could be other clever techniques to
 catch such garbage addresses early.

We already have such mechanism, the kernel code aware of the usermode
page access sets pcb_onfault. See the end of trap_pfault() handler.
In fact, we can catch it earlier, before even calling vm_fault().

BTW, I think this is esp. useful in the combination with the support
for the SMEP in recent Intel CPUs.

commit 2e1b36fa93f9499e37acf04a66ff0646d4f13536
Author: Konstantin Belousov kos...@pooma.home
Date:   Thu Aug 18 00:08:50 2011 +0300

Assert that the exiting process does not return to usermode.
On x86, do not call vm_fault() when the kernel is not prepared
to handle unsuccessful page fault.

diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c
index 4e5f8b8..55e1e5a 100644
--- a/sys/amd64/amd64/trap.c
+++ b/sys/amd64/amd64/trap.c
@@ -674,6 +674,19 @@ trap_pfault(frame, usermode)
goto nogo;
 
map = vm-vm_map;
+
+   /*
+* When accessing a usermode address, kernel must be
+* ready to accept the page fault, and provide a
+* handling routine.  Since accessing the address
+* without the handler is a bug, do not try to handle
+* it normally, and panic immediately.
+*/
+   if (!usermode  (td-td_intr_nesting_level != 0 ||
+   PCPU_GET(curpcb)-pcb_onfault == NULL)) {
+   trap_fatal(frame, eva);
+   return (-1);
+   }
}
 
/*
diff --git a/sys/i386/i386/trap.c b/sys/i386/i386/trap.c
index 5a8016c..e6d2b5a 100644
--- a/sys/i386/i386/trap.c
+++ b/sys/i386/i386/trap.c
@@ -831,6 +831,11 @@ trap_pfault(frame, usermode, eva)
goto nogo;
 
map = vm-vm_map;
+   if (!usermode  (td-td_intr_nesting_level != 0 ||
+   PCPU_GET(curpcb)-pcb_onfault == NULL)) {
+   trap_fatal(frame, eva);
+   return (-1);
+   }
}
 
/*
diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c
index 3527ed1..a69b7b8 100644
--- a/sys/kern/subr_trap.c
+++ b/sys/kern/subr_trap.c
@@ -99,6 +99,8 @@ userret(struct thread *td, struct trapframe *frame)
 
CTR3(KTR_SYSC, userret: thread %p (pid %d, %s), td, p-p_pid,
 td-td_name);
+   KASSERT((p-p_flag  P_WEXIT) == 0,
+   (Exiting process returns to usermode));
 #if 0
 #ifdef DIAGNOSTIC
/* Check that we called signotify() enough. */


pgp0chWxIkaWy.pgp
Description: PGP signature


Re: debugging frequent kernel panics on 8.2-RELEASE

2011-08-17 Thread Steven Hartland
- Original Message - 
From: Andriy Gapon a...@freebsd.org



Thanks to the debug that Steven provided and to the help that I received from
Kostik, I think that now I understand the basic mechanics of this panic, but,
unfortunately, not the details of its root cause.

It seems like everything starts with some kind of a race between terminating
processes in a jail and termination of the jail itself.  This is where the
details are very thin so far.  What we see is that a process (http) is in
exit(2) syscall, in exit1() function actually, and past the place where P_WEXIT
flag is set and even past the place where p_limit is freed and reset to NULL.
At that place the thread calls prison_proc_free(), which calls prison_deref().
Then, we see that in prison_deref() the thread gets a page fault because of what
seems like a NULL pointer dereference.  That's just the start of the problem and
its root cause.


Thats interesting, are you using http as an example or is that something thats
been gleaned from the debugging of our output? I ask as there's only one process
running in each of our jails and thats a single java process.

Now given your description there may be something I can add that may help
clarify what the cause could be.

In a nutshell the jail manager we're using will attempt to resurrect the jail
from a dieing state in a few specific scenarios.

Here's an exmaple:-
1. jail restart requested
2. jail is stopped, so the java processes is killed off, but active tcp sessions
may prevent the timely full shutdown of the jail.
3. if an existing jail is detected, i.e. a dieing jail from #2, instead of
starting a new jail we attach to the old one and exec the new java process.
4. if an existing jail isnt detected, i.e. where there where not hanging tcp
sessions and #2 cleanly shutdown the jail, a new jail is created, attached to
and the java exec'ed.

The system uses static jailid's so its possible to determine if an existing
jail for this service exists or not. This prevents duplicate services as
well as making services easy to identify by their jailid.

So what we could be seeing is a race between the jail shutdown and the attach
of the new process?

Now man 2 jail seems to indicate this is a valid use case for jail_set, as
it documents its support for JAIL_DYING as a valid option for flags, but I
suspect its something quite out of the ordinary to actually do, which may be
why this panic hasnt been seen before now.

As some background the reason we use static jailid's is to ensure only one
instance of the jailed service is running, and the reason we re-attach to
the dieing jail is so that jails can be restarted in a timely manor. Without
using the re-attach we would need to wait of all tcp sessions which have
been aborted to timeout.


So, of course, Steven is interested in finding and fixing the root cause.  I
hope we will get to that with some help from the prison guards :-)


Does the above potentially explain how we're getting to the situation
which generates the panic?

If so we can certainly look at using alternatives to the current design to
workaround this issue. Flagging the jail as permanent and using manual process
management and additional external locking to prevent duplicates, is what
instantly springs to mind.

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org