Re: [OpenAFS-devel] Solaris fixes for 1.4.x / AFS_SUN510_ENV

Jeffrey Hutzelman Mon, 11 Feb 2008 11:30:10 -0800

--On Monday, February 11, 2008 07:50:10 PM +0100 "Frank Batschulat (Home)"<[EMAIL PROTECTED]> wrote:

On Mon, 11 Feb 2008 19:28:51 +0100, Derrick Brashear <[EMAIL PROTECTED]>
wrote:

>> >> 1. SSYS process exiting considered harmful
>> >>
>> >>   The first problem is that setting process flag SSYS on a
>> >>   process that exits, as the afs_osi_Invisible routine on Solaris
>> >>   10 does, causes the system not to clean up the contract state
>> >>   of the process.  This leaves a dangling kernel-memory pointer
>> >>   in the contract table which used to point to the process struct.
>> >>
>> >>   Any user can corrupt kernel memory and cause a panic with the
>> >>   'ctstat' command and the system cannot shut down without either
>> >>   panicing or going into an infinite loop as svc.startd
>> >>   repeatedly tries to kill the non-existent process.
>> >>
>> >> I really don't know why the code would set SSYS on a userland
>> >> process that's about to exit in the first place.  Can anyone shed
>> >> any light?
>> >
>> > Threads that call afs_osi_Invisible are not about to exit; they're
>> > about to become long-lived AFS kernel threads.  Setting SSYS is
>> > correct; we just
>>
>> Actually it is not appropriate for an arbitrary thread/proc to set
>> SSYS.
>>
>> Only system processes [they exist only in kernel, i,e p_as is set to
>> kas] created with newproc() are eligible for SSYS, and that happens
>> automatically in newproc().
>
> This is a system process, just not one created by newproc().

actually there are only a few 'system processes' and these are sched,
init, pageout, fsflush, zsched and the cluster_wrapper. there are no
other 'system processes' in that term.

refer to main() in
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/commo
n/os/main.c

regular kernel threads are parented to sched (p0) while zone specific
kernel threads created by zthread_create() are parented to zsched.

> Presumably we need to do something analogous to the linux
> kernel_thread code, calling newproc.

nope, we've been there before:

http://www.openafs.org/pipermail/openafs-devel/2002-April/007896.html

I wonder what are you trying to accomplish by setting SSYS ? and I'm
still unclear if you are doing this to a kernel thread or a user land
process.


afsd->afs syscall() and then SSYS is set. Before the syscall returns,
SSYS is cleared. I don't have notes handy but I assume this was "we
really aren't interested in being signalled while we're in the
kernel". I guess then (if that's really it) lwp_sigmask, or switch to
real (not newproc) kernel threads.


ah, so the AFS daemon user land process issuing the AFS syscall is doing
this, thanks.

if thats the intent, ie. block all signal over the AFS syscall kernel
execution, the afsd could possibly use sigfillset(3C) &
thr_sigsetmask(3C), e.g

sigset_t sgset;

/* Block all signals

(void) sigfillset(&sgset);
(void) thr_sigsetmask(SIG_BLOCK, &sgset, NULL);

execute AFS syscall;

/* open for signals again

(void) thr_sigsetmask(SIG_UNBLOCK, &sgset, NULL);

I can't comment on the real kernel threads though because I'm not familiar
enough with how the syscall is currently implemented.

No. This is not a process calling a syscall that wants to block threadsfor a while and then return. This is a process which is donating itscontext to a kernel thread by calling a syscall that will never return(well, actually, it will eventuall return, when AFS is shut down, but notwithout returning the process state to something reasonably normal,including clearing SSYS).


Yes, we've been over this before.  You said:

You can not call exit() or something similar in a thread
created by newproc(). That would leave it still laying around
as it would not cleaned up, a process with freed threads.

Great. We have no intent to call exit on a process with SSYS set. Wedon't expect anything to wait on it, either. And we certainly don't wantsome bozo to be able to SIGKILL it and have it stop being scheduled. It'sa long-running kernel thread which performs tasks critical to the correctoperation of the AFS cache manager, and killing such a thread at the wrongmoment would likely result in deadlocks or in other threads (includingrandom user-land processes) blocking forever waiting for something that'snever going to happen.


-- Jeffrey T. Hutzelman (N3NHS) <[EMAIL PROTECTED]>
  Carnegie Mellon University - Pittsburgh, PA

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] Solaris fixes for 1.4.x / AFS_SUN510_ENV

Reply via email to