Thanks for the feedback!
On Fri, Jan 09, 2015 at 10:59:02AM +0100, Klaus Aehlig wrote:
> > +This behaviour is clearly wrong, but the problem doesn't arise often in
> > current
> > +setup, due to the fact that instances currently only have a single
> > +storage type.
>
> I guess, the main reason why it works in practise is that admins tend to
> group instances
> on node groups by needs. Otherwise, a single storage type per instance
> wouldn't suffice
> if we freely mix and match, e.g., file storage and DRBD.
>
That is another reason. Should I add it?
> > +Proposed changes
> > +================
> > +
> > +Definitions
> > +-----------
> > +
> > +* All disks have exactly one *desired storage unit*, which determines
> > where and
> > + how the disk can be stored. If the disk is transfered, the desired
> > storage
> > + unit remains unchanged. The desired storage unit includes specifics like
> > the
> > + volume group in the case of LVM based storage.
> > +* A *storage unit* is a specific storage location on a specific node.
> > Storage
> > + units have exactly one desired storage unit they can contain. A storage
> > unit
> > + further has a name, a total capacity, and a free capacity.
> > +* For the purposes of this document a *disk* has a desired storage unit
> > and a size.
> > +* A *disk can be moved* to a node, if there is at least one storage unit on
> > + that node which can contain the desired storage unit of the disk and if
> > the
> > + free capacity is at least the size of the disk.
> > +* An *instance can be moved* to a node, if all its disks can be moved there
> > + one-by-one.
> > +
> > +LUXI extension
> > +--------------
> > +
> > +The LUXI protocol is extended to include:
>
> please specify for which entity this information is included.
>
+ in the ``node``
> > +
> > +* ``storage``: a list of objects (storage units) with
> > + # Storage unit, containing in order:
> > + # storage type
> > + # storage key (e.g. volume group name)
> > + # extra parameters (e.g. flag for exclusive storage) as a list.
> > + # Amount free in MiB
> > + # Amount total in MiB
> > +
> > +.. code-block:: javascript
> > +
> > + {
> > + "storage": [
> > + { "stype": ["drbd8", "xenvg", [false]]
> > + , "free": 2000,
> > + , "total": 4000
> > + },
> > + { "stype": ["file", "/path/to/storage1", []]
> > + , "free": 5000,
> > + , "total": 10000
> > + },
> > + { "stype": ["file", "/path/to/storage2", []]
> > + , "free": 1000,
> > + , "total": 20000
> > + },
> > + { "stype": ["lvm-vg", "xenssdvg", [false]]
> > + , "free": 1024,
> > + , "total": 1024
> > + }
> > + ]
> > + }
> > +
> > +is an instance with an LVM volume group mirrored over DRBD, two file
> > storage
>
> "instance" or "node". Note that, since "instance" has a special meaning in
> Ganeti,
> if you use that word in its normal English meaning you have to specify of what
> it is an instance of.
>
s/instance/node/
> > +directories, one half full, one mostly full, and a non-mirrored volume
> > group.
> > +
> > +The storage type ``drbd8`` needs to be added in order to differentiate
> > between
> > +mirrored storage and non-mirrored storage.
> > +
> > +IAllocator protocol extension
> > +-----------------------------
> > +
> > +The same field is optionally present in the IAllocator protocol:
> > +
> > +* a new "storage" column is added, which is a semicolon separated list of
> > + comma separated fields in the order
> > + #. ``stype``
> > + #. ``free``
> > + #. ``total``
> > +
> > +For example:
> > +
> > + drbd8,2000,4000;file,5000,10000;file,1000,20000;lvm-vg,1024,1024
>
> That looks more like the text format. The IAllocator protocol is a JSON
> protocol.
>
True, the LUXI and IAllocator protocols are extended in the same way
however, so I'll modify the LUXI section to specify 'LUXI and IAllocator
protocol'. This section is the 'Text protocol extension'.
> > +
> > +Interpretation
> > +--------------
> > +
> > +hbal and hail will use this information only if available, if the data file
> > +doesn't contain the ``storage`` field the old algorithm is used.
> > +
> > +If the node information contains the ``storage`` field, hbal and hail will
> > +assume that only the space compatible with the disk's requirements is
> > +available. For an instance to fit a node, all it's disks need to fit there
> > +separately. For a disk to fit a node, a storage unit of the type of
> > +the disk needs to have enough free space to contain it.
>
> Please also specify what is supposed to happen if the new ``storage`` field
> promisses
> more space than the current total sum fields.
>
+ The case of the total free space being smaller than the free space reported by
+ the ``storage`` field can occur with shared storage. To accommodate for this
+ case, we ignore the total free space if the ``storage`` field is present.
> > +
> > +Balancing
> > +---------
> > +
> > +In order to determine a storage location for an instance, we collect
> > analogous
> > +metrics to the current total node free space metric -- namely the standard
> > deviation
> > +statistic of the free space per storage unit.
> > +
> > +The full storage metric for a given desired storage unit is a weighted sum
> > of
> > +the standard deviation metric of the storage units. The weights of the
> > storage
> > +units are proportional to the total of that storage unit and sum up to the
> > +weight of space in the old implementation (1.0).
>
> This has the effect, that the most scarce resource is valued the least. Is
> this
> on purpose? I'm thinking of a situation with lots of storage on conventional
> spinning disks and a limited amount of solid state disks.
>
This section is confusing, I'm rewording it if we keep the metric like
this. It talks about placing a specific desired storage unit, so for
example the space metric for DRBD with volume group xendrbd would be the
weighted metric of all DRBD(vg=xendrbd) storages of the nodes, weighted by
the size of the DRBD(vg=xendrbd) storage unit on that node.
node1: 9/10
node3: 50/100
node3: 70/100
avg = (9+50+70)/(10+100+100) = 128/210
then
m_storage = 10/210 * (9/10-128/210)^2
+ 50/210 * (50/100-128/210)^2
+ 70/210 * (70/100-128/210)^2
The reason I wrote it like this is that balancing the two big nodes
against each other should take priority over balancing the small one,
because this will allow us to store more instances quicker.
Do you disagree?
> > +This is necessary to
> > +
> > +#. Keep the metric compatible.
> > +#. Avoid that the metric of a node with many storage units is dominated by
> > them.
> > +
> > +Note that the metric is independent of the storage type to be placed, but
> > the
> > +other types don't change the ranking of the possible placements.
>
> I fail to understand that sentence. Can you please elaborate on it?
Again I should word this better. The question of the storage metric as I
see it is two-fold:
1. how do the metrics for different storage units interact and
2. how should the metric change with placement of an instance?
for the first question, it should ideally be the case that if an
instance on the storage unit A is placed on the cluster, the ranking of
the nodes considered for a placement of an instance of storage unit B
should not change in order.
The second question is dependent on this, for example if we just
aggregated the total free space as we currently do, we'd violate that
principle. This principle holds for example with an aggregation for
metrics of several storage units being
f(m_1, m_2_, ..., m_n) = w_1*g(m_1) + ... + w_n*g(m_n)
So much for the "other types don't change the ranking of the possible
placements" part.
As for the "independent of the storage unit" part:
The full metric is independent of the storage unit, because we want to
evaluate all storage units at the same time, but the metric for a
placement would be dependent on the desired storage unit. This sentence
should say that we can use the global metric instead of calculating the
storage unit specific metric because it preserves the order as long as
we only look at modification of the global metric.
>
>
> --
> Klaus Aehlig
> Google Germany GmbH, Dienerstr. 12, 80331 Muenchen
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> Geschaeftsfuehrer: Graham Law, Christine Elizabeth Flores
--
Google Germany GmbH
Dienerstr. 12
80331 München
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores