Re: [Bacula-users] Watchdog timer killed long-running backup

Alan Davis Tue, 06 Mar 2007 07:32:08 -0800

On Tue, 6 Mar 2007 09:39:35 +0100
  Kern Sibbald <[EMAIL PROTECTED]> wrote:
> On Monday 05 March 2007 23:57, Alan Davis wrote:
>> I understand the sanity check - but the job wasn't idle 
>>- the FD and SD
>> were both working and data was being written to tapes as 
>>expected for 6
>> days.
>>
>> Would the director not know that the job was running and 
>>just assume
>> that no job could take longer than the hard-coded 
>>timeout?
> 
> I don't know the answer to that question -- I suggest 
>you look at the code.  
> It should be looking at the socket use counts, but 
>perhaps it does not.
> 
> My personal opinion is that any job that runs 6 days is 
>totally insane.  You 
> have about 0.000001% chance of ever being able to 
>restore from it, and/or use 
> it as a basis for additional Incremental/Differential 
>backups.  Also the data 
> on that backup (IMO) is not valid unless the machine was 
>idle for those 6 
> days.


I realize that I'm pushing the limits in a number of ways, 
my environment isn't completely unique but it's not a 
standard business datacenter - I have multiple 30TB 
datastores.

The "industry" doesn't seem to have any good, cheap 
alternatives to large-scale backups and archiving. For 
example, Quantum's new DXi series of disk-based backup 
appliances top out at 11TB. I'd need 3 of them just to do 
the initial full backup of one of my datastores.

The goal of this first backup was to establish a 
tape-based archive. Future backups will be based on lists 
of files generated by a db query from our release process 
rather than using the incremental/differential file 
selection mechanism.

I could break the 30TB up into several individual backups, 
but that presents it's own issues for recovery and 
restoration.

I'll move further questions about the code to the 
developer's list.

Thanks for the quick responses to my questions. I get 
better support here than I've ever had from Veritas or 
Legato.

> 
> IMO, you need to re-think how you are doing backups.  If 
>that doesn't appeal 
> to you, you can always increase the timeout, but again 
>IMO, you are just 
> heading for trouble later.
> 
>>
>> The message seemed to indicate that the director was 
>>trying to talk to
>> the FD but couldn't, or was expecting a response to the 
>>mount that it
>> never got.
>>
>>
>> ----
>> Alan Davis
>> Senior Architect
>> Ruckus Network, Inc.
>> 703.464.6578 (o)
>> 410.365.7175 (m)
>> [EMAIL PROTECTED]
>> alancdavis AIM
>>
>> > -----Original Message-----
>> > From: Kern Sibbald [mailto:[EMAIL PROTECTED]
>> > Sent: Monday, March 05, 2007 2:55 PM
>> > To: bacula-users@lists.sourceforge.net
>> > Cc: Alan Davis
>> > Subject: Re: [Bacula-users] Watchdog timer killed 
>>long-running backup
>> >
>> > This is not a bug, but rather an insanity check.  If 
>>you want to have
>>
>> idle
>>
>> > jobs remain in the system longer, take a looks at 
>>src/lib/watchdog.c
>>
>> --
>>
>> > someplace in that file there should be a tag that sets 
>>the timeout,
>>
>> which
>>
>> > you
>> > can make longer as you wish.
>> >
>> > On Monday 05 March 2007 20:35, Alan Davis wrote:
>> > > I was running a very large archival backup and about 
>>20 hours into
>>
>> the
>>
>> > > backup I ran out of tapes that had the recycle flag 
>>set. I updated
>>
>> the
>>
>> > > flags and purged the first tape. The system then 
>>loaded the next
>>
>> tape
>>
>> > > and continued the backup. The SD (or FD), however, 
>>never signaled
>>
>> the
>>
>> > > DIR that the job had resumed and it stayed in 
>>"waiting for
>>
>> appendable
>>
>> > > Volume" (JS_WaitMedia) for 518415 secs (6 days) and 
>>then the DIR
>>
>> killed
>>
>> > > the job with the messages:
>> > >
>> > > 04-Mar 17:17 gannon-dir: 
>>LiveArchiveJob.2007-02-26_17.16.43 Error:
>> > > Watchdog sending kill after 518415 secs to thread 
>>stalled reading
>>
>> File
>>
>> > > daemon.
>> > > 04-Mar 17:17 gannon-dir: 
>>LiveArchiveJob.2007-02-26_17.16.43 Fatal
>>
>> error:
>> > > Network error with FD during Backup: ERR=Interrupted 
>>system call
>> > > 04-Mar 17:17 gannon-dir: 
>>LiveArchiveJob.2007-02-26_17.16.43 Fatal
>>
>> error:
>> > > No Job status returned from FD.
>> > >
>> > > The SD, FD and DIR are all running on the same node 
>>so network
>>
>> problems
>>
>> > > between them did not cause the timeout.
>> > >
>> > > The wait status seems to come from the SD and is 
>>reported by the
>>
>> DIR,
>>
>> > > but the kill message from the DIR indicates that not 
>>being able to
>> > > communicate with the FD was the reason it killed the 
>>job.
>> > >
>> > > I've looked at some of the code and the best 
>>candidate that I've
>>
>> found
>>
>> > > so far for where a problem might cause this is in
>> > > filed/heartbeat.c:sd_heartbeat_thread or somewhere 
>>in the
>>
>> acquire/mount
>>
>> > > code that a message isn't being sent back to the 
>>DIR.
>> > >
>> > > Due to the long runtime of the backup it's not 
>>practical for me to
>>
>> try
>>
>> > > to duplicate the problem exactly. I will try to 
>>create a reproducer
>>
>> with
>>
>> > > a smaller backup set once I have the archive backup 
>>completed.
>> > >
>> > > Any insight on the possible cause(s) would be 
>>greatly appreciated.
>> > >
>> > >
>> > > ----
>> > > Alan Davis
>> > > Senior Architect
>> > > Ruckus Network, Inc.
>> > > 703.464.6578 (o)
>> > > 410.365.7175 (m)
>> > > [EMAIL PROTECTED]
>> > > alancdavis AIM
>>
>> ------------------------------------------------------------------------
>>
>> > -
>> >
>> > > Take Surveys. Earn Cash. Influence the Future of IT
>> > > Join SourceForge.net's Techsay panel and you'll get 
>>the chance to
>>
>> share
>>
>> > > your opinions on IT & business topics through brief 
>>surveys-and earn
>> >
>> > cash
>>
>> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE
>> V
>>
>> > > _______________________________________________
>> > > Bacula-users mailing list
>> > > Bacula-users@lists.sourceforge.net
>> > > 
>>https://lists.sourceforge.net/lists/listinfo/bacula-users
>>
>> -------------------------------------------------------------------------
>> Take Surveys. Earn Cash. Influence the Future of IT
>> Join SourceForge.net's Techsay panel and you'll get the 
>>chance to share
>> your opinions on IT & business topics through brief 
>>surveys-and earn cash
>> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>> _______________________________________________
>> Bacula-users mailing list
>> Bacula-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/bacula-users


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Watchdog timer killed long-running backup

Reply via email to