Hello,

On 02.11.2005 19:26, Vadim A. Umanski wrote:

How do you do, bacula-users.

I've inherited a backup system powered by Bacula (Bacula 1.36.1), it
runs on Solaris 10 for x86 and stores data on a disk array. Previuos
sysadmin installed it, but now he is not accesible anymore. The server
running Bacula is used for some other important things also, so I have
to treat it and reconfigure it with caution.

That would be better...

It worked OK for some while, I'm new to Bacula so I didn't touch a
working software. It starts doing backups at early night and finishes
in the morning. Full backups every sunday and incrementals dayly. But
after some time one of bacula processes started to crash every morning
and 1 or more (or all) jobs were left not done. Such situation last
for some weeks - it become clear to me that I need help.

Ok, let's see what we can do.

Here I'll try to describe what's happening.

--------------------------------------------------------------

Normally on the server it looked like this
# ps -ef | grep bacula|grep -v grep
  bacula  1362     1   0 10:23:03 ?           0:00 
/usr/local/bacula/sbin/bacula-dir -u bacula -g bacula -v -c /usr/local/bacula/e
    root  1350     1   0 10:22:40 ?           0:00 
/usr/local/bacula/sbin/bacula-fd -u root -g root -v -c /usr/local/bacula/etc/ba
  bacula  1348     1   0 10:22:40 ?           0:00 
/usr/local/bacula/sbin/bacula-sd -u bacula -g bacula -v -c /usr/local/bacula/et

That's normal.

and every morning Director's process bacula-dir is missing.

That's bad.

Last morning log's end looks like this
# less /var/db/bacula/log
... skip some log output...
02-Nov 03:15 nfs4p-dir: Start Backup JobId 3960, 
Job=sinux-oracle.2005-11-02_03.15.00
02-Nov 03:15 sinux-fd: ClientRunBeforeJob: -su: line 8: ulimit: max user 
processes: cannot modify limit: Operation not permitt
ed

That one indicates a problem, I guess. There seems to be a limit on the number of processes a user can have running. Some script or program tries to increase that limit. You should investigate the script that is called as Client Run Before Job script for the job sinux-oracle. Just to point out the obvious: That script is not on the director machine (probably nfs-4p) but on sinux.

That situation *could* indicate a serious security problem, even a compromised database server. Good luck.

02-Nov 03:19 sinux-fd: ClientRunBeforeJob:
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: SQL*Plus: Release 10.1.0.3.0 - 
Production on Wed Nov 2 03:19:27 2005
02-Nov 03:19 sinux-fd: ClientRunBeforeJob:
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: Copyright (c) 1982, 2004, Oracle.  
All rights reserved.
02-Nov 03:19 sinux-fd: ClientRunBeforeJob:
02-Nov 03:19 sinux-fd: ClientRunBeforeJob:
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: Connected to:
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: Oracle Database 10g Enterprise 
Edition Release 10.1.0.3.0 - Production
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: With the Partitioning, OLAP and Data 
Mining options
02-Nov 03:19 sinux-fd: ClientRunBeforeJob:
02-Nov 03:19 sinux-fd: ClientRunBeforeJob:
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: TO_CHAR(SYSDATE,'YY
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: -------------------
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: 2005-11-02 03:19:28
02-Nov 03:19 sinux-fd: ClientRunBeforeJob:
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: Disconnected from Oracle Database 
10g Enterprise Edition Release 10.1.0.3.0 - Produ
ction
02-Nov 03:19 sinux-fd: ClientRunBeforeJob: With the Partitioning, OLAP and Data 
Mining options
02-Nov 03:19 s10-sd: Volume "Vol0086" previously written, moving to end of data.
02-Nov 03:21 sinux-fd: ClientRunAfterJob: -su: line 8: ulimit: max user 
processes: cannot modify limit: Operation not permitte
d

See above.

...
more output

"That's all, folks!" (c) :-(

I run

# /etc/bacula/bconsole

and see

Connecting to Director 127.0.0.1:9101
1000 OK: nfs4p-dir Version: 1.36.1 (26 November 2004)
Enter a period to cancel a command.
*status 1
Using default Catalog name=MyCatalog DB=bacula
Automatically selected Storage: File
Connecting to Storage daemon File at 10.253.4.15:9103

s10-sd Version: 1.36.1 (26 November 2004) i386-pc-solaris2.10 solaris 5.10
Daemon started 02-Nov-05 20:10, 0 Jobs run since started.

Running Jobs:
No Jobs running.
====

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
  3952  Incr      2,462      1,889,217 OK       02-Nov-05 01:21 ns02
  3953  Incr          1     33,512,324 OK       02-Nov-05 01:22 sinux
  3954  Incr         83     28,008,073 OK       02-Nov-05 01:23 dbh1-matroska
  3955  Incr          0              0 OK       02-Nov-05 01:23 dbh1-configs
  3956  Incr          0              0 OK       02-Nov-05 01:23 dbh1-home
  3957  Incr      1,418     84,707,857 OK       02-Nov-05 01:28 hpov-full
  3958  Incr         67    615,990,101 OK       02-Nov-05 01:37 dbh2-full
  3959  Full          1    186,384,929 OK       02-Nov-05 01:51 BackupCatalog
  3960  Full          5    181,932,043 OK       02-Nov-05 03:21 sinux-oracle
  3961  Incr      9,889  2,246,582,808 Cancel   02-Nov-05 10:22 cgatex-full
====

Device status:
Device "/d/0/bacula" is not open.
====

The last job is the most important - it's the mail server... :-(

It looks like that job hasn't failed but got cancelled - hat status should, as far as I know, only happen as a direct result of user intervention.

If I leave this console till next morning and try to enter any command
after the bacula-dir crashes it'll die also being unable to connect to
Director.

Ok, the DIR dies during the night.

You can do the following:
Either run the director with debug output enabled and capture the output. You'd call it with something like "./bacula-dir -v -d 200 -c /etc/bacula/bacula-dir >>/var/log/bacula-dir.output". Adjust paths and debug level to your needs... a debug levelof 100 gives a good overview of the program flow, 400 results in lots and lots of details, and 900 gives you more than you will need to locate the problem, I guess. After the DIR crashes, you should investigate the last lines of the output, probably post it here. Perhaps it helps to locate the problem.

The other possibility is to run the DIR under the debugger - there are some instructions in the manual. It would be best if you know a little about how to work with gdb, though.

Finally, and I suspect that this would be something you'd end up with anyway, you could upgrade to the current release version 1.38. This version does fix some bugs, introduces some features, and requires only minor - if at all - configuration changes. It does require a catalog upgrade, sou you will want to read the instructions carefully :-)

I suspect that, if you found a bug in bacula, you will be forcd to upgrade because it's unlikely that Kern will fix an older version.

I tried to search using quotes from logs and messages I was getting,
but I haven't found somthing that would match my problem. My
colleagues couldn't help me - they haven't seen all this before.

I can surely restart Bacula (with the startup script
/etc/rc3.d/S50bacula with target restart , for example) but the
promlem persists - I see exactly what I just wrote here.

--------------------------------------------------------------

Smart guys that know what to do - please help!

Maybe I should quote some extra logs or some configs or something
else... I really want to ask a good question so a good answer could
be given.

Well, start with the debug log and probably the debugger. That should help understanding what happens when Bacula crashes. Or upgrade to 1.38 and see if that fixes your problem (which might easily happen). The upgrade itself is not a problem as long as you know how your installed version was built (options to configure) and have the necessary toolchain and libraries installed. The catalog upgrade can be a problem as you can not easily revert to an older version...

Thanks for your attention. I really need help to make my problem clear
and solve it. Any good advice will move things from bad to good.

Good luck to everybody!

Well, and good luck for fixing your problems. Keep us informed, or post some more detailed information and I'm confident that can be fixed.

Arno

--
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to