<[EMAIL PROTECTED]> aka Arno Lehmann schrieb mit Datum Sun, 02 Mar 2008 12:50:17 +0100 in m2n.bacula.devel:
|> 2. This is the thing that I have been worrying the most about. I |> have been following various theories about what might happen |> there, yet to no avail. The last of my theories was that it might |> have to do with the migrations, but currently I tend to dismiss |> this theory also. In fact, I am still clueless. |> What happens is that the Director puts all jobs (and all newly |> started jobs) into either "waiting on max Storage jobs" or |> "waiting execution", while there is no job running on any client |> and no job running on the SD. It just does nothing and has to |> be restarted. | |That definitely qualifies as a bug... have you tried looking at the |debug output, once the DIR is in this state? This was a good hint. The debug shows this: >BxDir: jcr.c:603-0 OnEntry JobStatus=s set=s >BxDir: jcr.c:623-0 OnExit JobStatus=s set=s >BxDir: jobq.c:701-0 Wstore=Files >BxDir: jobq.c:723-0 Fail wncj=-2 And what I also have seen is rncj=-2, and rncj=3. Looking into jobq.c, I find that rncj is never supposed to take any value except 0 and 1 (maximum one read job per device). OTOH, I find that rncj is not a unique entity - it is just the NumConcurrentJobs of any Storage device. So, this seems not to be a migration issue, it seems to be a problem with multidrive autoloaders. According to the manual, since Bacula version 1.whatever an autoloader has to be defined as a single device in the DIR. So, if this autoloader has multiple drives, it is well possible that these drives are used for reading AND writing at the same time. And this seems to break the rncj/wncj logic. My current most likely interpretation runs that way: Suppose we have one restore running: rncj=1. Then we get two backups running: wncj=rncj=3. Then the restore terminates and sets rncj=0. So, when the two backup jobs terminate, it goes to -2 - and this is where the show ends. I am now trying the following as a fix, and see if it helps. rgds, PMc --- src/dird/jobq.c.orig Mon Dec 10 18:54:41 2007 +++ src/dird/jobq.c Sun Mar 9 00:27:02 2008 @@ -478,7 +478,8 @@ */ if (jcr->acquired_resource_locks) { if (jcr->rstore) { - jcr->rstore->NumConcurrentJobs = 0; + if (jcr->rstore->NumConcurrentJobs > 0) + jcr->rstore->NumConcurrentJobs--; Dmsg1(200, "Dec rncj=%d\n", jcr->rstore->NumConcurrentJobs); } if (jcr->wstore) { @@ -738,7 +739,8 @@ Dmsg1(200, "Dec wncj=%d\n", jcr->wstore->NumConcurrentJobs); } if (jcr->rstore) { - jcr->rstore->NumConcurrentJobs = 0; + if(jcr->rstore->NumConcurrentJobs > 0); + jcr->rstore->NumConcurrentJobs--; Dmsg1(200, "Dec rncj=%d\n", jcr->rstore->NumConcurrentJobs); } set_jcr_job_status(jcr, JS_WaitClientRes); @@ -753,7 +755,8 @@ Dmsg1(200, "Dec wncj=%d\n", jcr->wstore->NumConcurrentJobs); } if (jcr->rstore) { - jcr->rstore->NumConcurrentJobs = 0; + if(jcr->rstore->NumConcurrentJobs > 0); + jcr->rstore->NumConcurrentJobs--; Dmsg1(200, "Dec rncj=%d\n", jcr->rstore->NumConcurrentJobs); } jcr->client->NumConcurrentJobs--; ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel