On 03/21/12 06:27 PM, Stephen Thompson wrote:
> On 03/21/2012 09:46 AM, Marco van Wieringen wrote:
>> Stephen Thompson<stephen<at>  seismo.berkeley.edu>  writes:
>>> This seems similar, but in my case, I'm not waiting for a tape to be
>>> loaded, the proper tape is already in the drive, but being written 
>>> to be
>>> another job.  I see how it's possible the code is treating both cases
>>> the same, that the drive is unavailable (for whatever reason), 
>>> therefore
>>> the starting job must wait.
>>>
>>> But there's a logical distinction, where if my starting job had started
>>> when the job that's despooling (and blocking the drive) was 
>>> spooling, my
>>> job would have started and be happily spooling along while the other 
>>> job
>>> despools.  There's no contention for an appropriate volume in this case
>>> and it seems like (logically at least) my job should start.
>>>
>> Taking a quick peek at the code:
>>
>> src/stored/append.c:do_append_data
>>
>> calls acquire_device_for_append which as far as I know want
>> a tape drive it can write to, are your jobs in state running
>> or only scheduled e.g. the ones that are waiting to spool.
>> Because after the acquire_device_for_append succeeds the
>> state of the job is set to RUNNING.
>>
>
> Interesting.  Yes, my "jobs that are not spooling" are in a "running" 
> state, which means they have moved past the drive acquire, which is 
> where I thought they might have been blocking.  Apparently not.
>
>
Ok that is at least something. so the acquire is not a problem great I'm 
also reading this
particular piece of the code for the first time. I know dir/fd and 
catalog by hart now
sd is something I haven't done much on yet other then extending it for some
own needed development. So the things I say are just my first guess.

>> The whole spooling is setup after that and then it
>> writes the session header and after that the
>> fd is told it can send its data.
>>
>> I guess you are stuck there. It may not be obvious but maybe
>> we could only reserve the drive and start the spooling
>> but then we need to write the session header the moment we
>> start despooling after we have actually acquired the storage
>> which we reserved earlier. Its not obvious if this might work
>> and as the SD is kind of interesting I'm not sure this
>> is going to work.
>>
>
>
> Do you mean "spooling" when you say "despooling" in this last paragraph?
>
Nope I was thinking the acquire blocked but as it seems its not.
If it was we could work around that by reserving the drive not acquiring it.
But then as soon as you start despooling you have to do the work you
skipped on before. But as it seems the blocking is not in the acquire I
wonder if it gets to the point where it says to the fd send me your
data because after that I would expect spooling to take place.

> I'm not following you, though I don't have an understanding of each 
> step in the process.  Are you saying that something (session header) 
> is written to tape when a job starts, but before it starts spooling?  
> That would certainly explain the job being blocked.  If that something 
> is written to the database, I don't see why it would be blocking.

This is the exact code:

    if (!acquire_device_for_append(dcr)) {
       jcr->setJobStatus(JS_ErrorTerminated);
       return false;
    }

    jcr->sendJobStatus(JS_Running);

    if (dev->VolCatInfo.VolCatName[0] == 0) {
       Pmsg0(000, _("NULL Volume name. This shouldn't happen!!!\n"));
    }
    Dmsg1(50, "Begin append device=%s\n", dev->print_name());

    begin_data_spool(dcr);
    begin_attribute_spool(jcr);

    Dmsg0(100, "Just after acquire_device_for_append\n");
    if (dev->VolCatInfo.VolCatName[0] == 0) {
       Pmsg0(000, _("NULL Volume name. This shouldn't happen!!!\n"));
    }
    /*
     * Write Begin Session Record
     */
    if (!write_session_label(dcr, SOS_LABEL)) {
       Jmsg1(jcr, M_FATAL, 0, _("Write session label failed. ERR=%s\n"),
          dev->bstrerror());
       jcr->setJobStatus(JS_ErrorTerminated);
       ok = false;
    }
    if (dev->VolCatInfo.VolCatName[0] == 0) {
       Pmsg0(000, _("NULL Volume name. This shouldn't happen!!!\n"));
    }

    /* Tell File daemon to send data */
    if (!fd->fsend(OK_data)) {
       berrno be;
       Jmsg1(jcr, M_FATAL, 0, _("Network send error to FD. ERR=%s\n"),
             be.bstrerror(fd->b_errno));
       ok = false;
    }

The write_session_label seems to only fill the DCR with the start
of the backup so its not flushed which the name seems to indicate.

>
> When my system is free today, I'm going to try to collect more 
> information under controlled circumstances.
>
Running the sd and fd under debugging will at least show why its 
blocking anything
else is just guessing based on the code which as I said I have no deep 
knowledge of.

Marco

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to