This sunday we suffered two simultaneous fileserver crashes. Both of
them crashed while talking to the same client, and they both crashed
at the same codeline.

This was using openafs-1.2.6 on Tru64 5.0a on alpha.

The crash occured in GetClient() on line 1485 in viced/host.c

(ladebug) where
#0  0x3ff800d08f8 in _sigprocmask(0x2, 0x20, 0x0, 0x0, 0x0, 0x2000324cea5) in 
/usr/shlib/libc.so
#1  0x3ff800d2b6c in __sigprocmask(0x2, 0x20, 0x0, 0x0, 0x0, 0x2000324cea5) in 
/usr/shlib/libc.so
#2  0x3ff801894f0 in abort(0x2, 0x20, 0x0, 0x0, 0x0, 0x2000324cea5) in 
/usr/shlib/libc.so
#3  0x120042334 in AssertionFailed(file=0x140004cf0="../viced/host.c", line=32) 
"../util/assert.c":44
>4  0x12003344c in GetClient(tcon=0x1419c0900, cp=0x2000324d858) "../viced/host.c":1485
#5  0x120029f54 in GetVolumePackage(tcon=0x1419c0900, Fid=0x2000324d988, 
volptr=0x2000324d860, targetptr=0x2000324d870, chkforDir=<no value>, 
parent=0x2000324d868, client=0x2000324d858, locktype=1, rights=0x2000324d850, 
anyrights=0x2000324d848) "../viced/afsfileprocs.c":4922
#6  0x12001dce4 in SAFSS_FetchStatus(tcall=0x1418d1c00, Fid=0x2000324d988, 
OutStatus=0x2000324d9a8, CallBack=0x2000324d978, Sync=0x2000324d960) 
"../viced/afsfileprocs.c":817
#7  0x12001ee84 in SRXAFS_FetchStatus(tcon=0x1419c0900, Fid=0x2000324d988, 
OutStatus=0x2000324d9a8, CallBack=0x2000324d978, Sync=0x2000324d960) 
"../viced/afsfileprocs.c":1173
#8  0x12005df18 in _RXAFS_FetchStatus(z_call=0x1418d1c00, z_xdrs=0x2000324da20) 
"../fsint/afsint.ss.c":174
#9  0x120063c8c in RXAFS_ExecuteRequest(z_call=0x1418d1c00) "../fsint/afsint.ss.c":1892
#10 0x12007ccb8 in rxi_ServerProc(threadID=<no value>, newcall=0x0, 
socketp=0x2000324dac0) "../rx/rx.c":1326
#11 0x120092c9c in rx_ServerProc() "../rx/rx_pthread.c":288
#12 0x12009246c in server_entry(argp=0x3ff805b44a0) "../rx/rx_pthread.c":94
#13 0x3ff805b5f3c in __thdBase(0x2, 0x20, 0x0, 0x0, 0x0, 0x2000324cea5) in 
/usr/shlib/libpthread.so
(ladebug) p client
0x0
(ladebug) p tcon->nSpecific
2
(ladebug) p rxcon_client_key
1
(ladebug) p tcon->specific[1]
0x0

The way I read the code, tcon->specific[rxcon_client_key] can only
become NULL at the same time as tcon->nSpecific is 2 if rx_SetSpecific
is called to set it to NULL. This is done in h_TossStuff_r()

The log shows:

Sun Sep 15 17:40:43 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:40:59 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:40:59 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:40:59 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:40:59 2002 CB: new identity for host 130.237.49.75:26386, deleting
Sun Sep 15 17:42:20 2002 CB: new identity for host 130.237.49.75:26386, deleting

26386 is 4711 if you swap byteorder, so there seems to be a byteorder
error somewhere, but that's kinda unrelated I guess.

: datan mattiasa \$ ; rxdebug 130.237.49.75 4711 -version
Trying 130.237.49.75 (port 4711):
AFS version: arla-0.35.8pre1

So, I asume that h_GetHost_r (called from preable) manages to
h_Release_r(host) and toss it. 

What I don't understand is how it can later become reused.

/mattiasa
_______________________________________________
OpenAFS-devel mailing list
[EMAIL PROTECTED]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to