Re: [Bacula-users] Need help debugging SD crash

2010-04-08 Thread Matija Nalis
On Tue, Apr 06, 2010 at 08:40:20AM -0600, Robert LeBlanc wrote:
 I've tried in the past to do exactly this. Bacula will usually spit out an
 error that the tape could not be moved or in rarer situations say the drive
 is not there. I then shut down bacula-sd and try to run the mt eject command
 I I usually get back about ten lines that describe the error, but it does
 really make sense. Sometimes the drive doesn't appear as a device on the
 system any more. As far as the tape library, the Overland Neo 8000 most of

[...]

 the time says soft removal error on the screen and will keep saying that if
 I try to have the library remove it. There is no easy way to get to the
 hardware eject button as the library is fully enclosed.

It looks like the drive gets confused if it gets commands too fast
(and/or while it is still processing previous commands)... Anyway, it
looks like problem outside bacula (probably either the kernel, or a
drive firmware, or both are at error).

 drives and our LTO-4 drive. The only thing that I can think of is that
 bacula is trying to take some shortcuts (issuing a command to move the tape
 and expecting the tape library to correctly rewind the tape, eject and then
 move it and maybe bacula is not quiet letting go of the drive fast enough
 and there gets a deadlock between the drive controlled by Bacula and the
 library trying to control it), or there is a kernel/driver problem.

Only thing bacula does is execute mtx-changer script; it is the
scripts responsibility to does everything needed for your drive /
changer combination. The default script is usually good, but you may
need to tailor it for your needs (if it needs manual rewind before
offline, or things like that).

 I've set the offline=1 in mtx-changer.conf and that seems to help a little,
 I've still encountered some drive unmouting issues, but nothing that bacula
 hasn't been able to recover from on it's own or with very little manual
 intervention.

I run mine (IBM3584) with:
offline=1
offline_sleep=2
load_sleep=20

I do recall having sporadic issues with load_sleep of just a 2-3
seconds, so I've put it to 20 to allow the drive to settle fully
before issuing a bunch of mt status to it in wait_for_drive().

 I was pretty sure the messages were informational, I'm glad that someone can
 confirm that. I'll keep working on the problem to see what I can come up
 with. If there is a better way to tell Bacula to be stupid slow with unmount
 and mount requests, that may help me find where in the process things are
 getting hung up.

Well, you can put (in 5.0.1 at least) offline_sleep and load_sleep to
30 seconds or more for example, that might help if drive is getting
confused while receiving commands too fast.

On older versions (3.0.x or 2.4 ?) you can edit the mtx-changer shell
script itself, IIRC it had commented out sleep statements at right
places already...

--
Download Intel#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Need help debugging SD crash

2010-04-08 Thread Robert LeBlanc
On Thu, Apr 8, 2010 at 9:02 AM, Matija Nalis
mnalis+bac...@carnet.hrmnalis%2bbac...@carnet.hr
 wrote:

 On Tue, Apr 06, 2010 at 08:40:20AM -0600, Robert LeBlanc wrote:
  I've tried in the past to do exactly this. Bacula will usually spit out
 an
  error that the tape could not be moved or in rarer situations say the
 drive
  is not there. I then shut down bacula-sd and try to run the mt eject
 command
  I I usually get back about ten lines that describe the error, but it does
  really make sense. Sometimes the drive doesn't appear as a device on the
  system any more. As far as the tape library, the Overland Neo 8000 most
 of

 [...]

  the time says soft removal error on the screen and will keep saying that
 if
  I try to have the library remove it. There is no easy way to get to the
  hardware eject button as the library is fully enclosed.

 It looks like the drive gets confused if it gets commands too fast
 (and/or while it is still processing previous commands)... Anyway, it
 looks like problem outside bacula (probably either the kernel, or a
 drive firmware, or both are at error).

  drives and our LTO-4 drive. The only thing that I can think of is that
  bacula is trying to take some shortcuts (issuing a command to move the
 tape
  and expecting the tape library to correctly rewind the tape, eject and
 then
  move it and maybe bacula is not quiet letting go of the drive fast enough
  and there gets a deadlock between the drive controlled by Bacula and the
  library trying to control it), or there is a kernel/driver problem.

 Only thing bacula does is execute mtx-changer script; it is the
 scripts responsibility to does everything needed for your drive /
 changer combination. The default script is usually good, but you may
 need to tailor it for your needs (if it needs manual rewind before
 offline, or things like that).

  I've set the offline=1 in mtx-changer.conf and that seems to help a
 little,
  I've still encountered some drive unmouting issues, but nothing that
 bacula
  hasn't been able to recover from on it's own or with very little manual
  intervention.

 I run mine (IBM3584) with:
 offline=1
 offline_sleep=2
 load_sleep=20

 I do recall having sporadic issues with load_sleep of just a 2-3
 seconds, so I've put it to 20 to allow the drive to settle fully
 before issuing a bunch of mt status to it in wait_for_drive().

  I was pretty sure the messages were informational, I'm glad that someone
 can
  confirm that. I'll keep working on the problem to see what I can come up
  with. If there is a better way to tell Bacula to be stupid slow with
 unmount
  and mount requests, that may help me find where in the process things are
  getting hung up.

 Well, you can put (in 5.0.1 at least) offline_sleep and load_sleep to
 30 seconds or more for example, that might help if drive is getting
 confused while receiving commands too fast.

 On older versions (3.0.x or 2.4 ?) you can edit the mtx-changer shell
 script itself, IIRC it had commented out sleep statements at right
 places already...


Thanks, this is helpful, I'll give these a try.

Robert LeBlanc
Life Sciences  Undergraduate Education Computer Support
Brigham Young University
--
Download Intel#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Need help debugging SD crash

2010-04-06 Thread Matija Nalis
On Sun, Apr 04, 2010 at 01:20:49PM -0600, Robert LeBlanc wrote:
 I'm having problems with our SD and tapes being locked in the
 drive occasionally.

How does it manifest exactly ? bconsole umount command returns error,
or remains in some state (check with status storage) ? Which state
and/or error ?

Have you tried shutting down bacula-sd and ejecting tape with mt
eject and/or mt offline ? Do they succeed (and the drive ejects)
or do they return error (and which one) ? Double check that bacula-sd
is down before you try those (they won't work if bacula-sd is still
having the drive open).

And if mt(1) also fails, can you eject tape manually by using tape
library eject function and/or pressing hardware eject button on the
drive itself (depending on the library type...) ?

If mt works but bacula-sd doesn't, than you can rule out hardware and
kernel -- it is bacula problem (and usually status storage will
show it -- it can happen sometimes if you have more than one drive
that it deadlocks by waiting for a tape that is in the other drive).

 At first I thought this might be a problem with our tape
 library.

That still looks like the most probable cause to me - like a drive in
the library is having problems. We've had a similar issue with one of
several LTO2 drives in our library; it would (sometimes) take the
tape and refuse to give it back (on mt eject and even physical
button touch). Needed power cycling and long (half a minute?) button
press to make it give the tape back.

After it happened third time (always the same drive) we kicked it out
of the library. Other drives worked OK all the time.

If the hardware button always works but software commands don't, it
could be fiber cables and/or GBIC/SPF (which we refused to believe at
one time because drives were always detected OK and worked, albeit
sometimes much slower than normal, without any errors in kernel logs,
and would also lock up). You can try cleaning tape also.

 Then I saw these errors in the syslog. I switched out the Qlogic FC
 adapter thinking that maybe it was just losing all the paths to the drive.

AFAIR you would get different errors if it loses path completely (but
it is possible for drive to behave erratically even if it doesn't
lose path)

 I'm still getting the errors, so I'm not sure where the hangup is. I can't
 tell if it's a bug in the kernel module, mt or bacula. Can someone give me
 some pointers to narrowing this down? This has been happening for over a
 year and through several kernel and bacula versions.
 
 This is Debian Squeeze
 
 Linux lsddomainsd 2.6.32-trunk-686 #1 SMP Sun Jan 10 06:32:16 UTC 2010 i686
 GNU/Linux

The INFO: messages themselves are just normal feature of newer
2.6.x kernels, they are informational message only (See INFO:) that
tells you some system call (like open(2) or write(2) or read(2)) is
taking longer than 120 seconds to complete. They didn't exist in
older kernels.

It is there to catch problems with I/O schedulers and problematic
hardware issues -- but sometime it needs to be increased for tape
drives (it is quite possible for open(2) or lseek(2) on tape to have
to rewind it, and that sometimes can take more than two minutes).

you can raise the current kernel limit with:
echo 300  /proc/sys/kernel/hung_task_timeout_secs

or (to survive reboot) by putting:
kernel.hung_task_timeout_secs=300
in /etc/sysctl.conf (or a file in /etc/sysctl.d directory)

But as I say, those will not help your lockup problems, just make the
spurious messages go away when they are to be expected.


Try the other things in the mail to narrow the problem down to
bacula, kernel or hardware.

--
Download Intel#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Need help debugging SD crash

2010-04-06 Thread Robert LeBlanc
On Tue, Apr 6, 2010 at 3:23 AM, Matija Nalis
mnalis+bac...@carnet.hrmnalis%2bbac...@carnet.hr
 wrote:

 On Sun, Apr 04, 2010 at 01:20:49PM -0600, Robert LeBlanc wrote:
  I'm having problems with our SD and tapes being locked in the
  drive occasionally.

 How does it manifest exactly ? bconsole umount command returns error,
 or remains in some state (check with status storage) ? Which state
 and/or error ?

 Have you tried shutting down bacula-sd and ejecting tape with mt
 eject and/or mt offline ? Do they succeed (and the drive ejects)
 or do they return error (and which one) ? Double check that bacula-sd
 is down before you try those (they won't work if bacula-sd is still
 having the drive open).


And if mt(1) also fails, can you eject tape manually by using tape

 library eject function and/or pressing hardware eject button on the
 drive itself (depending on the library type...) ?


I've tried in the past to do exactly this. Bacula will usually spit out an
error that the tape could not be moved or in rarer situations say the drive
is not there. I then shut down bacula-sd and try to run the mt eject command
I I usually get back about ten lines that describe the error, but it does
really make sense. Sometimes the drive doesn't appear as a device on the
system any more. As far as the tape library, the Overland Neo 8000 most of
the time says soft removal error on the screen and will keep saying that if
I try to have the library remove it. There is no easy way to get to the
hardware eject button as the library is fully enclosed.

We have had the LTO-3 drives replaced on multiple occasions for this very
reason, that is why I don't think it is the drive. Each drive has a fibre
connection to the Fibre switch and it is has happened on both our LTO-3
drives and our LTO-4 drive. The only thing that I can think of is that
bacula is trying to take some shortcuts (issuing a command to move the tape
and expecting the tape library to correctly rewind the tape, eject and then
move it and maybe bacula is not quiet letting go of the drive fast enough
and there gets a deadlock between the drive controlled by Bacula and the
library trying to control it), or there is a kernel/driver problem.

I've set the offline=1 in mtx-changer.conf and that seems to help a little,
I've still encountered some drive unmouting issues, but nothing that bacula
hasn't been able to recover from on it's own or with very little manual
intervention.


 If mt works but bacula-sd doesn't, than you can rule out hardware and
 kernel -- it is bacula problem (and usually status storage will
 show it -- it can happen sometimes if you have more than one drive
 that it deadlocks by waiting for a tape that is in the other drive).

  At first I thought this might be a problem with our tape
  library.

 That still looks like the most probable cause to me - like a drive in
 the library is having problems. We've had a similar issue with one of
 several LTO2 drives in our library; it would (sometimes) take the
 tape and refuse to give it back (on mt eject and even physical
 button touch). Needed power cycling and long (half a minute?) button
 press to make it give the tape back.

 After it happened third time (always the same drive) we kicked it out
 of the library. Other drives worked OK all the time.

 If the hardware button always works but software commands don't, it
 could be fiber cables and/or GBIC/SPF (which we refused to believe at
 one time because drives were always detected OK and worked, albeit
 sometimes much slower than normal, without any errors in kernel logs,
 and would also lock up). You can try cleaning tape also.

  Then I saw these errors in the syslog. I switched out the Qlogic FC
  adapter thinking that maybe it was just losing all the paths to the
 drive.

 AFAIR you would get different errors if it loses path completely (but
 it is possible for drive to behave erratically even if it doesn't
 lose path)


I have seen times where there are path errors and that is when the drive
seems to disappear from the system completely, but this is not the usual
case and the one that causes the most problems.


  I'm still getting the errors, so I'm not sure where the hangup is. I
 can't
  tell if it's a bug in the kernel module, mt or bacula. Can someone give
 me
  some pointers to narrowing this down? This has been happening for over a
  year and through several kernel and bacula versions.
 
  This is Debian Squeeze
 
  Linux lsddomainsd 2.6.32-trunk-686 #1 SMP Sun Jan 10 06:32:16 UTC 2010
 i686
  GNU/Linux

 The INFO: messages themselves are just normal feature of newer
 2.6.x kernels, they are informational message only (See INFO:) that
 tells you some system call (like open(2) or write(2) or read(2)) is
 taking longer than 120 seconds to complete. They didn't exist in
 older kernels.

 It is there to catch problems with I/O schedulers and problematic
 hardware issues -- but sometime it needs to be increased for tape
 drives (it is quite possible 

[Bacula-users] Need help debugging SD crash

2010-04-04 Thread Robert LeBlanc
I'm having problems with our SD and tapes being locked in the
drive occasionally. At first I thought this might be a problem with our tape
library. Then I saw these errors in the syslog. I switched out the Qlogic FC
adapter thinking that maybe it was just losing all the paths to the drive.
I'm still getting the errors, so I'm not sure where the hangup is. I can't
tell if it's a bug in the kernel module, mt or bacula. Can someone give me
some pointers to narrowing this down? This has been happening for over a
year and through several kernel and bacula versions.

This is Debian Squeeze

Linux lsddomainsd 2.6.32-trunk-686 #1 SMP Sun Jan 10 06:32:16 UTC 2010 i686
GNU/Linux

bacula-sd: invalid option -- 'V'
Copyright (C) 2000-2010 Free Software Foundation Europe e.V.

Version: 5.0.1 (24 February 2010)

mt-st v. 1.1


Apr  4 07:08:23 lsddomainsd kernel: [137640.964059] INFO: task
bacula-sd:12439 blocked for more than 120 seconds.
Apr  4 07:08:23 lsddomainsd kernel: [137640.980153] echo 0 
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Apr  4 07:08:23 lsddomainsd kernel: [137640.996879] bacula-sd D f7e47b88
0 12439  1 0x
Apr  4 07:08:23 lsddomainsd kernel: [137640.996889]  f551aec0 00200086
0020 f7e47b88 00200246 c13f4000 c13f4000 c13ef604
Apr  4 07:08:23 lsddomainsd kernel: [137640.996901]  f551b07c c3808000
 f7fc856f f570c060 f8024000 f6d72820 
Apr  4 07:08:23 lsddomainsd kernel: [137640.996911]  c3803604 f551b07c
020b2e87 0001 f6d72800   
Apr  4 07:08:23 lsddomainsd kernel: [137640.996919] Call Trace:
Apr  4 07:08:23 lsddomainsd kernel: [137640.996957]  [f7fc856f] ?
qla2x00_start_scsi+0x29b/0x2cc [qla2xxx]
Apr  4 07:08:23 lsddomainsd kernel: [137640.996969]  [c1259f49] ?
schedule_timeout+0x20/0xb0
Apr  4 07:08:23 lsddomainsd kernel: [137640.996976]  [c1122934] ?
blk_peek_request+0x135/0x143
Apr  4 07:08:23 lsddomainsd kernel: [137640.996988]  [f7e32987] ?
scsi_dispatch_cmd+0x185/0x1e5 [scsi_mod]
Apr  4 07:08:23 lsddomainsd kernel: [137640.997000]  [f7e37382] ?
scsi_request_fn+0x3c1/0x47a [scsi_mod]
Apr  4 07:08:23 lsddomainsd kernel: [137640.997006]  [c1259e52] ?
wait_for_common+0xa4/0x100
Apr  4 07:08:23 lsddomainsd kernel: [137640.997014]  [c102da50] ?
default_wake_function+0x0/0x8
Apr  4 07:08:23 lsddomainsd kernel: [137640.997019]  [f92fe756] ?
st_scsi_execute_end+0x0/0x45 [st]
Apr  4 07:08:23 lsddomainsd kernel: [137640.997024]  [f92fee28] ?
st_do_scsi+0x28d/0x2b5 [st]
Apr  4 07:08:23 lsddomainsd kernel: [137640.997028]  [f92ffb81] ?
st_int_ioctl+0x624/0xa68 [st]
Apr  4 07:08:23 lsddomainsd kernel: [137640.997034]  [c11be12a] ?
release_sock+0xf/0x7f
Apr  4 07:08:23 lsddomainsd kernel: [137640.997040]  [c11f0c2a] ?
tcp_sendmsg+0x69d/0x77a
Apr  4 07:08:23 lsddomainsd kernel: [137640.997044]  [f92ff92e] ?
st_int_ioctl+0x3d1/0xa68 [st]
Apr  4 07:08:23 lsddomainsd kernel: [137640.997050]  [c11bbb11] ?
__sock_sendmsg+0x43/0x4a
Apr  4 07:08:23 lsddomainsd kernel: [137640.997055]  [f930198a] ?
st_ioctl+0xb1b/0xe62 [st]
Apr  4 07:08:23 lsddomainsd kernel: [137640.997059]  [c1259e5d] ?
wait_for_common+0xaf/0x100
Apr  4 07:08:23 lsddomainsd kernel: [137640.997065]  [c10b1aa2] ?
do_sync_write+0xc0/0x107
Apr  4 07:08:23 lsddomainsd kernel: [137640.997070]  [f9300e6f] ?
st_ioctl+0x0/0xe62 [st]
Apr  4 07:08:23 lsddomainsd kernel: [137640.997075]  [c10bc220] ?
vfs_ioctl+0x1c/0x5f
Apr  4 07:08:23 lsddomainsd kernel: [137640.997079]  [c10bc7b4] ?
do_vfs_ioctl+0x4aa/0x4e5
Apr  4 07:08:23 lsddomainsd kernel: [137640.997083]  [c10b17ee] ?
fsnotify_modify+0x5a/0x61
Apr  4 07:08:23 lsddomainsd kernel: [137640.997087]  [c10b23ee] ?
vfs_write+0x9e/0xd6
Apr  4 07:08:23 lsddomainsd kernel: [137640.997091]  [c10bc830] ?
sys_ioctl+0x41/0x58
Apr  4 07:08:23 lsddomainsd kernel: [137640.997097]  [c10030fb] ?
sysenter_do_call+0x12/0x28
Apr  4 07:10:23 lsddomainsd kernel: [137760.996059] INFO: task
bacula-sd:12439 blocked for more than 120 seconds.
Apr  4 07:10:23 lsddomainsd kernel: [137761.012949] echo 0 
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Apr  4 07:10:23 lsddomainsd kernel: [137761.030339] bacula-sd D f7e47b88
0 12439  1 0x
Apr  4 07:10:23 lsddomainsd kernel: [137761.030346]  f551aec0 00200086
0020 f7e47b88 00200246 c13f4000 c13f4000 c13ef604
Apr  4 07:10:23 lsddomainsd kernel: [137761.030355]  f551b07c c3808000
 f7fc856f f570c060 f8024000 f6d72820 
Apr  4 07:10:23 lsddomainsd kernel: [137761.030363]  c3803604 f551b07c
020b2e87 0001 f6d72800   
Apr  4 07:10:23 lsddomainsd kernel: [137761.030371] Call Trace:
Apr  4 07:10:23 lsddomainsd kernel: [137761.030409]  [f7fc856f] ?
qla2x00_start_scsi+0x29b/0x2cc [qla2xxx]
Apr  4 07:10:23 lsddomainsd kernel: [137761.030421]  [c1259f49] ?
schedule_timeout+0x20/0xb0
Apr  4 07:10:23 lsddomainsd kernel: [137761.030428]  [c1122934] ?
blk_peek_request+0x135/0x143
Apr  4 07:10:23 lsddomainsd kernel: [137761.030439]