Thanks for the information Hartmut.

I tried setting ulimit to 1000000 blocks and rerunning the salvage.  I still
got no core file (salvager "seemed" to complete):

[atums2:~]# ulimit -a
core file size          (blocks, -c) 1000000
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 49152
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 49152
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[atums2:~]# bos salvage atums2 /vicepb chdata.sn
Starting salvage.
bos: salvage completed

SalvageLog file shows the same thing as before.

Then I tried running 'gdb' and got:

[atums2:~]# gdb /usr/afs/bin/salvager
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/afs/bin/salvager...(no debugging symbols
found)...done.
(gdb) run /vicepb 536871656 -debug
Starting program: /usr/afs/bin/salvager /vicepb 536871656 -debug
warning: no loadable sections found in added symbol-file system-supplied DSO
at 0x2aaaaaaab000
Mon Jan 24 15:16:47 2011 Assertion failed! file vol-salvage.c, line 2859.

Program received signal SIGABRT, Aborted.
0x0000003408c30265 in raise () from /lib64/libc.so.6

The log file then showed:

[atums2:~]# tail /usr/afs/logs/SalvageLog
@(#) OpenAFS 1.4.12 built  2010-12-13 1928681 19919656
01/24/2011 15:16:47 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepb
536871656 -debug)
01/24/2011 15:16:47 2 nVolumesInInodeFile 64
01/24/2011 15:16:47 CHECKING CLONED VOLUME 536871657.
01/24/2011 15:16:47 chdata.sn.readonly (536871657) updated 04/04/2007 15:29
01/24/2011 15:16:47 Partially allocated vnode 2 deleted.

So I assume that I need to dig into vol-salvage.c around line 2859 to figure
out why the Assertion failed?  

I should also note that the rest of the AFS cell is running the "SL" version
of OpenAFS, rather than "SLC" like this node.  One possibility is that I
could switch to those RPMS since I have hit issues in the past with CERN
customizations for OpenAFS.

Thanks,

Shawn

-----Original Message-----
From: Hartmut Reuter [mailto:[email protected]] 
Sent: Monday, January 24, 2011 9:04 AM
To: McKee, Shawn
Cc: [email protected]
Subject: Re: [OpenAFS] Problem with Off-line volumes...unable to bring
On-line

Looks like a crash of the salvager. The SalvageLog should end differently
with 
the summary line for the RW-volume. Are there any core files in
/usr/afs/logs? 
If not, make sure ulimit for core file size isn't set to 0 and retry.

You also could run the salvager by hand under gdb to see why it crashes. You

need then to add the -debug flag to prevent it from forking. E.g.

gdb /usr/afs/bin/salvager
...
(gdb) run /vicepb 536871656 -debug


Good luck,
Hartmut

McKee, Shawn wrote:
> Hi Everyone,
>
> I am having a problem with one of my OpenAFS file servers. About ½ of
> the volumes are “Off-line” and I am unable to bring them online. First
> some system info and then I will list problem details and what I have
tried.
>
> The system is running Scientific Linux 5.5/x86_64 (basically CentOS 5.5
> 64-bit). The openafs rpms are:
>
> [atums2:~]# rpm -qa | grep openafs
>
> openafs-kpasswd-1.4.12-6.cern
>
> openafs-client-1.4.12-6.cern
>
> kernel-module-openafs-2.6.18-194.3.1.el5-1.4.12-5.cern
>
> openafs-1.4.12-6.cern
>
> kernel-module-openafs-2.6.18-194.8.1.el5-1.4.12-5.cern
>
> openafs-krb5-1.4.12-6.cern
>
> kernel-module-openafs-2.6.18-238.1.1.el5-1.4.12-6.cern
>
> openafs-server-1.4.12-6.cern
>
> The version of ‘e2fsprogs’ is 1.39
>
> The system has an ext3 1TB partition for AFS:
>
> [atums2:~]# df /vicepb
>
> Filesystem 1K-blocks Used Available Use% Mounted on
>
> /dev/sda1 1007931664 635382472 321349196 67% /vicepb
>
> The system has 931 volumes and only 470 are On-line while 461 are
Off-line:
>
> [atums2:~]# vos listvol atums2
>
> Total number of volumes on server atums2 partition /vicepb: 931
>
> chamber.OLD_eml4a07 536872814 RW 8634169 K Off-line
>
> chamber.OLD_eml4a07.readonly 536872815 RO 8634169 K On-line
>
> chamber.OLD_eml4a09 536872817 RW 702642 K Off-line
>
> chamber.OLD_eml4a09.readonly 536872818 RO 702642 K On-line
>
> …
>
> Total volumes onLine 470 ; Total volumes offLine 461 ; Total busy 0
>
> I have run ‘bos salvage’ on the partition multiple times. I have
> restarted the system. I have run a force fsck.ext3 check on the
> underlying partition (no problems found). Only RW volumes are Off-line.
> All RO volumes are On-line. There are a few RW volumes On-line (8 out of
> 469) but the rest won’t come On-line.
>
> Here is a particular volume which is Off-line:
>
> [atums2:~]# vos examine chdata.sn
>
> chdata.sn 536871656 RW 598 K Off-line
>
> atums2.cern.ch /vicepb
>
> RWrite 536871656 ROnly 0 Backup 0
>
> MaxQuota 10000000 K
>
> Creation Fri May 26 04:02:49 2006
>
> Copy Wed Oct 11 12:35:42 2006
>
> Backup Sun Jun 11 00:30:10 2006
>
> Last Access Fri Jan 7 16:38:32 2011
>
> Last Update Wed Apr 4 15:29:42 2007
>
> 0 accesses in the past day (i.e., vnode references)
>
> RWrite: 536871656 ROnly: 536871657 RClone: 536871657
>
> number of sites -> 3
>
> server atums1.cern.ch partition /vicepi RO Site -- Old release
>
> server atums2.cern.ch partition /vicepb RW Site -- New release
>
> server atums2.cern.ch partition /vicepb RO Site -- New release
>
> Try to bring online:
>
> [atums2:~]# vos online -server atums2 -partition /vicepb -id chdata.sn
>
> The FileLog shows:
>
> Sun Jan 23 22:57:03 2011 GetBitmap: addled vnode index in volume
> chdata.sn; volume needs salvage
>
> Sun Jan 23 22:57:03 2011 VAttachVolume: error getting bitmap for volume
> (/vicepb//V0536871656.vol)
>
> Try to Salvage:
>
> [atums2:~]# bos salvage atums2 /vicepb chdata.sn
>
> Starting salvage.
>
> bos: salvage completed
>
> The SalvageLog shows:
>
> [atums2:~]# tail /usr/afs/logs/SalvageLog
>
> @(#) OpenAFS 1.4.12 built 2010-12-13 1928681 19919656
>
> 01/23/2011 22:58:19 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager
> /vicepb 536871656)
>
> 01/23/2011 22:58:19 2 nVolumesInInodeFile 64
>
> 01/23/2011 22:58:19 CHECKING CLONED VOLUME 536871657.
>
> 01/23/2011 22:58:19 chdata.sn.readonly (536871657) updated 04/04/2007
15:29
>
> 01/23/2011 22:58:19 Partially allocated vnode 2 deleted.
>
> Try again:
>
> [atums2:~]# vos online -server atums2 -partition /vicepb -id chdata.sn
>
>
> FileLog has the same message:
>
> Sun Jan 23 22:59:05 2011 GetBitmap: addled vnode index in volume
> chdata.sn; volume needs salvage
>
> Sun Jan 23 22:59:05 2011 VAttachVolume: error getting bitmap for volume
> (/vicepb//V0536871656.vol)
>
> Salvage attempt again:
>
> [atums2:~]# bos salvage atums2 /vicepb chdata.sn
>
> Starting salvage.
>
> bos: salvage completed
>
> [atums2:~]# tail /usr/afs/logs/SalvageLog
>
> @(#) OpenAFS 1.4.12 built 2010-12-13 1928681 19919656
>
> 01/23/2011 23:00:07 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager
> /vicepb 536871656)
>
> 01/23/2011 23:00:07 2 nVolumesInInodeFile 64
>
> 01/23/2011 23:00:07 CHECKING CLONED VOLUME 536871657.
>
> 01/23/2011 23:00:07 chdata.sn.readonly (536871657) updated 04/04/2007
15:29
>
> 01/23/2011 23:00:07 Partially allocated vnode 2 deleted.
>
> Same result as if the prior salvage didn’t do anything. This is exactly
> what happens on other volumes I have tried to bring online.
>
> So how would I fix this? Any suggestions for how to get the rest of
> these volumes On-line?
>
> Let me know if you need further details. Thanks,
>
> Shawn
>


-- 
-----------------------------------------------------------------
Hartmut Reuter                  e-mail          [email protected]
                                phone            +49-89-3299-1328
                                fax              +49-89-3299-1301
RZG (Rechenzentrum Garching)    web    http://www.rzg.mpg.de/~hwr
Computing Center of the Max-Planck-Gesellschaft (MPG) and the
Institut fuer Plasmaphysik (IPP)
-----------------------------------------------------------------

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to