On Thursday 31 December 2009 14:06:26 Martin Simmons wrote:
> >>>>> On Wed, 30 Dec 2009 09:44:49 +0100, Jesper Krogh said:
> >
> > Martin Simmons wrote:
> > >>>>>> On Tue, 29 Dec 2009 11:05:09 +0100, Jesper Krogh said:
> > >>
> > >> Kern Sibbald wrote:
> > >>> The Kaboom chapter of the manual tells you how to run the Director
> > >>> under the debugger.  You can also attach to the Director while it is
> > >>> running, using:
> > >>>
> > >>>   cd <bacula-binary-directory>
> > >>>   gdb bacula-dir <pid-of-director>
> > >>
> > >> A small month, the problem is still present.. takes hours to get "from
> > >> done" and to actual restore starts. I've managed to get a backtrace:
> > >>
> > >> Thread 2 (Thread 0x42767950 (LWP 10832)):
> > >> #0  0x000000000040a4a8 in add_findex (bsr=0x6dd468, JobId=32927,
> > >> findex=132808) at bsr.c:554
> > >> #1  0x0000000000432965 in restore_cmd (ua=0x6dcb08, cmd=<value
> > >> optimized out>) at ua_restore.c:1094
> > >> #2  0x0000000000425e56 in do_a_command (ua=0x6dcb08, cmd=0x6d5b10 "1")
> > >> at ua_cmds.c:180
> > >> #3  0x0000000000438781 in handle_UA_client_request (arg=<value
> > >> optimized out>) at ua_server.c:147
> > >> #4  0x000000000046cb8b in workq_server (arg=<value optimized out>) at
> > >> workq.c:357
> > >> #5  0x00007f233e9553f7 in start_thread () from /lib/libpthread.so.0
> > >> #6  0x00007f233db1db4d in clone () from /lib/libc.so.6
> > >> #7  0x0000000000000000 in ?? ()
> > >>
> > >> Repeating this with "continue/interrrupt" gives the same trace but
> > >> with different findex= values.
> > >>
> > >> The restore block looks like this:
> > >>
> > >> +--------+-------+-----------+-----------------+---------------------+
> > >>------------+
> > >>
> > >> | JobId  | Level | JobFiles  | JobBytes        | StartTime           |
> > >>
> > >> VolumeName |
> > >> +--------+-------+-----------+-----------------+---------------------+
> > >>------------+
> > >>
> > >> | 32,927 | F     | 3,183,314 | 684,965,311,013 | 2009-12-12 17:31:41 |
> > >>
> > >> 000779L3   |
> > >>
> > >> | 32,927 | F     | 3,183,314 | 684,965,311,013 | 2009-12-12 17:31:41 |
> > >>
> > >> 000789L3   |
> > >>
> > >> | 32,927 | F     | 3,183,314 | 684,965,311,013 | 2009-12-12 17:31:41 |
> > >>
> > >> 000804L3   |
> > >>
> > >> | 32,927 | F     | 3,183,314 | 684,965,311,013 | 2009-12-12 17:31:41 |
> > >>
> > >> 001805L3   |
> > >>
> > >> | 32,927 | F     | 3,183,314 | 684,965,311,013 | 2009-12-12 17:31:41 |
> > >>
> > >> 001806L3   |
> > >>
> > >> | 32,927 | F     | 3,183,314 | 684,965,311,013 | 2009-12-12 17:31:41 |
> > >>
> > >> 001807L3   |
> > >>
> > >> | 33,446 | D     |   136,256 |  50,695,957,124 | 2009-12-28 08:01:50 |
> > >>
> > >> 004048L4   |
> > >>
> > >> | 33,473 | I     |     1,224 |  16,023,974,683 | 2009-12-28 14:41:19 |
> > >>
> > >> 004059L4   |
> > >>
> > >> | 33,501 | I     |    11,188 |  24,448,676,227 | 2009-12-29 01:40:23 |
> > >>
> > >> 004059L4   |
> > >> +--------+-------+-----------+-----------------+---------------------+
> > >>------------+
> > >>
> > >> I'm on 2.4.3 and the bsr.c:554 is
> > >>
> > >>    /* Walk down fi chain and find where to insert insert new FileIndex
> > >> */ for ( ; fi; fi=fi->next) {
> > >>       if (findex == (fi->findex2 + 1)) {  /* extend up */
> > >>          RBSR_FINDEX *nfi;
> > >>          fi->findex2 = findex;
> > >>
> > >> It I get some more time I'll try to add debug information to find out
> > >> where it's actually looping. Suggestions are certainly welcome.
> > >
> > > It might be a variant of this problem:
> > >
> > > http://article.gmane.org/gmane.comp.bacula.user/54164/match=add%5ffinde
> > >x
> >
> > It looks quite a lot like the same problem. But I did a diff of the
> > bsr.c of the freshest one with the 2.4.3 one and there are not changes.
> > Since an upgrade is non-reversibel I would prefer not to be "forced" to
> > do it but take it at a time where I had sufficient amount of testing
> > time.
> >
> > Can you point to the changes that are supposed to deal with the problem?
>
> AFAIK, bsr.c didn't change.  The fix was in the code that builds the tree,
> which now sorts the items by FileIndex.  The function db_get_file_list was
> added and is called by build_directory_tree in ua_restore.c.

Yes, and in addition, I forgot at which point, but we also modified the in 
memory tree from linked lists to red-black binary trees.

In general, unless you have a huge number of files in the set being prepared, 
building the bsr should not really take a long time.  Sorting out the 
JobMedia index records *is* very compute intensive, but in general, there 
should not be too many of them.  If you are finding that there are a lot of 
JobMedia records, then perhaps someone has poorly configured the "Maximum 
File Size" directive in the Storage daemon Device resource.  It should be set 
to a minimum of 1G, and if you have either an LTO-4 or a huge number of files 
in the FileSet, then it probably should be set at 5G.

The manual has a bit on configuring the Maximum File Size ...

Regards,

Kern


------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to