Kern Sibbald wrote:
On Tuesday 22 March 2005 22:00, Jeff McCune wrote:
>
Basically, the dbcheck program was operating as it should.  However, the
operation I asked it to perform ran for over *three days* without sign
of completing anytime soon.  After performing the proposed optimization,
the operation still ran for 26+ hours.  This effectively makes the
operation useless.  An operation that runs on a locked catalog for over
26 hours is not acceptable in a production backup system.

A couple of points here. Normally, one should not have to run the dbcheck program. It is *not* something to run every day, but something to run when your database is screwed up. If both Bacula and your underlying database program function correctly, then you should never have to run dbcheck. In 5 years, I think I have run it twice on my db -- both times because of a Bacula bug which is now well behind us.

I agree that dbcheck is something that should be treated as a last resort utility to try and recover catalog information. With that said, we still find ourselves in a situation where the utility doesn't complete in a reasonable amount of time for a "large" database. Large is vague, but I'm still just wondering if you'd agree that the documentation might reflect the fact that certain aspects of this system seem to exist on different orders of time.

I'm planning on digging pretty deep into this, as I'm highly motivated to make bacula work well for us. With that said, my principal fear at this point in time is that I'll push the limits of one operation, say backup and retention, and then go to do a restore and find that the current implementation can't process the requested operation in a reasonable amount of time.

So, to put my money where my mouth is, I'll start looking at the time order of the major operations personally.

The other point is that I have made no attempt to optimize the longest running part of the dbcheck code. Some smart DB guy could undoubtedly speed it up by at least a factor of 10 and possibly a factor of 100. The pruning algorithm in Bacula core code has the potential to run for equally long times, but it is much more sophisticated (i.e. lots more lines of code), and so it breaks the task up into smaller chunks, uses *lots* of memory, and if there is too much work to do it only prunes a part each time it is called.

Probably better to just note in the documentation that dbcheck is unoptimized and may not run reasonably on "large" catalogs.

If I want to keep 2 fulls in the catalog at any given time, I should
expect *at least* 50 million file records at any given time.

Am I naive trying to cram this much information into one MySQL database?
Should I be splitting this up across multiple catalogs?

I cannot personally answer these questions, but I think you can by running some tests. One of the great strengths of Bacula is using the SQL database -- it is something that no other backup program does (at least to my knowledge) at the same time, SQL DB engines are not known for their speed, so there are bound to be some growing problems until we figure out how to configure and run such gigantic DBs. In the short run you might try seriously thinking about how you can reduce your retention periods.

I will be reducing my retention periods substantially for the short term.

Regarding managing the File catalog, have you considered using MySQL Merged Tables through the MERGE storage engine?

I personally don't have experience with it, but a professional colleague mentioned the idea to me as I explained the problems I was running into with the large File table and maintaining the index as large chunks of the table were pruned and reinserted.

http://dev.mysql.com/doc/mysql/en/merge-storage-engine.html

It seems that this method of representing the File catalog is well suited to the task, as we typically insert and prune all rows related to a job, a volume, etc... as a logical grouping. The current method treats the operation row-by-row, which seems to have significant overhead given the data.

Consider if we could group all file records for a job into a single table object and then have a logical table that represents our current view of the File table. This would allow us to perform relatively inexpensive operations while pruning, as you could simply unlink the table rather than delete every row associated with that job.

This might also help with locking issues, as each table would be locked separately from the group.

Uniqueness might pose a problem, as a MERGE table cannot enforce uniqueness over the set of underlying tables.

Should I investigate other optimizations to deal with this volume of information
inside the catalog?


Yes, as well as optimizations to dbcheck if you really feal the need to run it.


I don't have a "need" to run it regularly, but I'd like to be *able* to run it if the need arises. =) So I'll look at this as well.

It may be that bacula isn't yet ready to manage a catalog of this
volume, which is perfectly fine.

Perhaps, I have always worried about this, but apparently a good number of users have found workarounds -- one even backs up a million or more files each job!

The other end of this, which you have not mentioned, is the time for the restore command to build the in memory tree -- that is something that really needs more work.

I agree... Again, we could get into a situation where a backup job completes in a reasonable period of time, but a restore would not.

What are bacula's limitations in terms of long running operations versus
catalog size?  Are they linearly related, exponentially, etc...?

If someone knows the answer to this, he knows a lot more than I do, and I would like to hear the answer, and to document it.

I suppose I can dust off my CIS textbooks and revisit time complexity of various algorithms. =)

I think that you probably did what a lot of users did not do -- you read the manual, and unfortunately fell into the Mulitple Connections problem, and I am sorry for that. The documentation did say that the directive had not been tested -- in any case, I have now removed the documentation from the manual, and the ability to enable it in the code.


Actually, Multiple Connections is news to me. I've looked through my subversion revision control commits and I've never used that option.

I'm not entirely surprised that I ended up with a corrupt database though, for various reasons.

Regards,
--
Jeff McCune
OSU Department of Mathematics System Support
(614) 292-4962
gpg --keyserver pgp.mit.edu --recv-key BAF3211A

Attachment: signature.asc
Description: OpenPGP digital signature



Reply via email to