I'm running into a few issues with enterprise-scale backups and have a wishlist of major/minor tweaks I'd like to see.
We've recently obtained a new 7-drive Neo8000 autochanger (which "works great" out of the box), but it's shown up some bacula issues relating to simultaneous jobs and limiting resource hogging on clients, server and clustered filesystem arrays. Wish 1: It would be nice to allow a Pool or Job to have a choice between more than "one or all" of the available drives (that's effectively what "prefer mounted volume") gives. Wish 2: (related) It would be useful to have max concurrent job limits per pool. Wish 3: It would also be good to have bacula take notice of the per-drive limits on an autochanger: Right now if you choose the autochanger as backup device its limit is the only control and that can result in one drive running 10 jobs (even if the individual drive entry has a max concurrent jobs setting lower than this) whilst others are idle because the director is waiting on max storage jobs. This can result in fileserver backups (timeconsuming) preventing desktop backups taking place, etc. Wish 4: It would be good to be able to define a group of clients and then set maximum concurrent jobs for that group. Why: Linux (and most other OS) NFS server code is fundamentally broken and unsafe for clustering as it ignores filesystem locks set by other process. As a result it's unsafe (risk of data corruption) to allow clustered filesystems to allow activity from any node OTHER than the one acting as a NFS server (which opens the question of why bother with clustered filesystems, the answer is that they're useful in a high-availability environment because the NFS service can be transferred to another node in seconds) (This same problem also raises a risk of data corruption on any Linux system acting as NFS fileserver, clustered or not! The only safe way to simultaneously export multiple protocols is to export them from a NFS client and not run any processes on the NFS fileserver which directly manipulate the NFS exported filesystem) On top of the above problems, with GFS (and most other clustered filesystems) a read/write lock must be propagated across the cluster for every file being opened on each cluster node (nfs ignores this!), which can drive network load up dramatically on an incremental backup as well as hitting actual backup rates quite hard. One node is usually notionally the master for any filesystem and in general the master is dynamically decided by whichever node is actually making the most lock requests. In order to accomodate the problem I've had to define a virtual bacula client per NFS service. That follows the filesystem's location, but breaks previous restrictions I was able to enforce using per-client job limits. This is a major problem, because most of our FSes live on a couple of 40Tb nearline storage arrays. As the number of simultaneous backups increases their performance falls away rapidly. The current situation is resulting in backups being able to badly affect NFS server performance - which users have noticed and are complaining loudly about. I really need to restrict the number of simultaneous backups coming out of any given array and the only way this seems feasible is to be able to group clients and then impose a simultaneous job limit across them Wish 5: Better optimisation/caching of directory lists (is this possible?) Most of us are aware that the more entries there are in a directory the slower it is to load. Users are not aware of this - and they resent being told to keep things in hierarchial layouts instead of one large flat space. GFS and GFS2 behave incredibly badly if there are a lot of files in a directory. I've seen them take 5-6 minutes to open a directory with 10,000 entries and up to 30 minutes to open a directory with 100,000 files in it (this not only affects the process, it also causes the entire filesystem to slow down for all users). When Bacula hits a directory with a lot of files in it on an incremental backup things get even slower. :-( Feedback and ideas are welcome. Telling me not to use GFS doesn't tell me anything I don't already know, however I'm stuck with it for the moment and avoiding any more cluster deployments until Redhat make it work properly (I have a 400TB deployment to handle in 2 weeks which will be XFS or Ext4 for the meantime) Alan ------------------------------------------------------------------------------ Nokia and AT&T present the 2010 Calling All Innovators-North America contest Create new apps & games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel