On 2018-07-27 12:23, Stefan G. Weichinger wrote:
Am 27.07.2018 um 17:02 schrieb Jean-Francois Malouin:
You should also consider playing with dumporder.
I have it set to 'TTTTTTTT' and that makes the longest (time wise)
dumps go first so that the fast ones get push at the end.
In one config I have:
dumporder "TTTTTTTT"
flush-threshold-dumped 100
flush-threshold-scheduled 100
taperflush 100
autoflush yes
so that all the dumps will wait until the longest one are done.
It also won't go until it can fill one volume (100%). You can
obviously go further than that if you have enough hold disk.
Or at least it's my understanding...
(the ML was down for a while, so that's the reason for my delayed
response, it should work now)
I checked "dumporder" in that config, it was "BTBT...", I changed it to
"TTT..." now for a test.
Although I am not 100% convinced that this will do the trick ;-)
We will see.
I never fully understood that parameter and its influence so far, to me
it's a bit "unintuitive".
Perhaps I can help with that.
Part of what Amanda's scheduling does is figure out the size that each
backup will be on each run (based on the estimate process), how much
bandwidth it will need while dumping (based on the bandwidth settings
for that particular dump type), and the amount of time it will take
(predicted based on the size, prior timing data, and possibly the
bandwidth). That information is then used together with the 'dumporder'
setting to control how each dumper chooses what dump to do next when it
finishes dumping. Each letter in the value corresponds to exactly one
dumper, and controls only that dumper's selection.
The size-based selection is generally the easiest to explain, it just
says to pick the largest (for 'S') or smallest (for 's') dump out of the
set and run that next.
The bandwidth-based selection is only relevant if you have bandwidth
settings configured. Without them, it treats all dumps as equal, and
picks the next dump based solely on the order that amanda has them
sorted (which, IIRC, matches the order found in the disk list). With
them, it uses a similar selection method to the size-based selection,
just looking at bandwidth instead of size.
The time-based selection is where things get tricky, but they get tricky
because of how complicated it is to predict how long a dump will take,
not because the selection is complicated (it works just like size-based
selection, just looking at estimated runtime instead of size). Pretty
much, the timing data is extrapolated by looking at previous dumps of
the DLE, correlating size and actual run-time. I'm not sure what
fitting method it uses for the extrapolation (my first guess would be
simple linear extrapolation, because that's easy and should work most of
the time), and I'm also not sure what, if any, impact bandwidth has on
the calculation.
So, in short you have:
* 'S' and 's': Simple deterministic selection based on the predicted
size of the dump.
* 'B' and 'b': Simple deterministic selection based on bandwidth
settings if they are defined, otherwise trivial FIFO selection.
* 'T' and 't': Not quite deterministic selection based on predicted
execution time of the dump process.
So, for a couple of examples:
* The default setting 'BTBTBTBT' This will have half the dumpers select
dumps that will take the largest amount of time, and the other select
the ones that will take the largest amount of bandwidth. This works
reasonably well if you have bandwidth settings configured and wide
variance in dump size.
* What you're looking at testing 'TTTTTTTT': This is a trivial case of
all dumpers selecting the dumps that will take the longest time. If
you're dumping almost all similar hosts, this will be essentially
equivalent to just selecting the largest. If you're dumping a wide
variety of different hosts, it will be equivalent to selecting the
largest on the first dump, but after that will select based on which
system takes the longest.
* What I use on my own systems 'SSss' (I only run four dumpers, not
eight): This is a reasonably simple option that gives a good balance
between getting dumps done as quickly as possible, and not wasting time
waiting on the big ones. Two of the dumpers select whatever dump is the
largest, so that some of the big ones get started right away, while the
other two select the smallest dumps, so that those get backed up
immediately. I've done some really simple testing that indicates that
this actually gets all the dumps done faster on average than the default
for the case of all your systems being able to dump data at the same rate.
* What we use where I work 'TTSSSSss': This is one where things get a
bit complicated. There are three different ways things get selected
here. First, two of the eight dumpers will select dumps that are going
to take the longest amount of time. Then, you have four that will pull
the largest ones, and two that will pull the smallest. This gets really
good behavior where I work because we have a handful of decade old
systems that we need to keep backed up which take _forever_ to back up,
but most of our other systems are new and don't take too long. On the
first dump, this is equivalent to 'SSSSSSss', but after that, the slow
systems get priority to run while everything else is dumping even though
they are not the largest or smallest dumps, so the backup process
doesn't stall out waiting on them to finish at the end.