Re: [RFC master] Add move instance improvements design document

Thomas Thrainer Mon, 10 Feb 2014 23:43:56 -0800

On Fri, Feb 7, 2014 at 2:55 PM, Hrvoje Ribicic <[email protected]> wrote:


> On Thu, Feb 6, 2014 at 9:47 AM, Thomas Thrainer <[email protected]>wrote:
>
>>
>>
>>
>> On Tue, Feb 4, 2014 at 6:05 PM, Hrvoje Ribicic <[email protected]> wrote:
>>
>>> This patch adds a design document exploring zeroing and lock reduction
>>> as options for the improved performance and parallelism of
>>> cross-cluster instance moves.
>>>
>>> Signed-off-by: Hrvoje Ribicic <[email protected]>
>>> ---
>>>  doc/design-move-instance-improvements.rst |  182
>>> +++++++++++++++++++++++++++++
>>>  1 file changed, 182 insertions(+)
>>>  create mode 100644 doc/design-move-instance-improvements.rst
>>>
>>> diff --git a/doc/design-move-instance-improvements.rst
>>> b/doc/design-move-instance-improvements.rst
>>> new file mode 100644
>>> index 0000000..22b4bf5
>>> --- /dev/null
>>> +++ b/doc/design-move-instance-improvements.rst
>>> @@ -0,0 +1,182 @@
>>> +========================================
>>> +Cross-cluster instance move improvements
>>> +========================================
>>> +
>>> +.. contents:: :depth: 3
>>> +
>>> +To move instances across clusters, Ganeti provides the move-instance
>>> tool. It
>>> +uses the RAPI to create new instances in the destination cluster, ready
>>> to
>>> +import data from instances in the source cluster.
>>> +
>>> +The tool works correctly and reliably, but depending on bandwidth and
>>> priority,
>>> +an instance disk of considerable size requires a long time to transfer.
>>> This is
>>> +inconvenient at best, and can be remedied by either reducing the length
>>> of the
>>> +transfers, or allowing more operations to run in parallel with instance
>>> moves.
>>> +
>>> +The former can be achieved through the zeroing of empty space on
>>> instance disks
>>> +and compressing them prior to transfer, and the latter by reducing the
>>> amount of
>>> +locking happening during an instance move. As the approaches aim to
>>> tackle two
>>> +different aspects of the problem, they do not exclude each other and
>>> will be
>>> +presented independently.
>>> +
>>> +Zeroing instance disks
>>> +======================
>>> +
>>> +Support for disk compression during instance moves was partially
>>> present before,
>>> +but cleaned up and explicitly added as the --compress option only as of
>>> Ganeti
>>> +2.10. While compression lowers the amount of data sent, further gains
>>> can be
>>> +achieved by taking advantage of the structure of the disk - namely,
>>> sending only
>>> +used disk sectors.
>>> +
>>> +There is no direct way to achieve this, as it would require that the
>>> +move-instance tool is aware of the structure of the file system.
>>> Mounting the
>>> +filesystem is not an option, primarily due to security issues. A disk
>>> primed to
>>> +take advantage of a disk driver exploit could cause an attacker to
>>> breach
>>> +instance isolation and gain control of a Ganeti node.
>>> +
>>> +An indirect way for this performance gain to be achieved is the zeroing
>>> of the
>>> +empty hard disk space. Sequences of zeroes can be compressed and thus
>>> +transferred very efficiently, all without the host knowing that these
>>> are empty
>>> +space. This approach can also be dangerous if a sparse disk is zeroed
>>> in this
>>> +way, causing ballooning. As Ganeti does not seem to make special
>>> concessions for
>>> +moving sparse disks, the only difference should be the disk space
>>> utilization
>>> +on the current node.
>>> +
>>> +Zeroing approaches
>>> +++++++++++++++++++
>>> +
>>> +Zeroing is a feasible approach, but the node cannot perform it as it
>>> cannot
>>> +mount the disk. Only virtualization-based options remain, and of those,
>>> using
>>> +Ganeti's own virtualization capabilities makes the most sense. There
>>> are two
>>> +ways of doing this - creating a new helper instance, temporary or
>>> persistent, or
>>> +reusing the target instance.
>>> +
>>> +Both approaches have their disadvantages. Creating a new helper instance
>>> +requires managing its lifecycle, taking special care to make sure no
>>> helper
>>> +instance remains left over due to a failed operation. Even if this were
>>> to be
>>> +taken care of, disks are not yet separate entities in Ganeti, making the
>>> +temporary transfer of disks between instances hard to implement and
>>> even harder
>>> +to make robust. The reuse can be done by modifying the OS running on the
>>> +instance to perform the zeroing itself when notified via the new
>>> instance
>>> +communication mechanism, but this approach is neither generic, nor
>>> particularly
>>> +safe. There is no guarantee that the zeroing operation will not
>>> interfere with
>>> +the normal operation of the instance, nor that it will be completed if a
>>> +user-initiated shutdown occurs.
>>> +
>>> +A better solution can be found by combining the two approaches -
>>> re-using the
>>> +virtualized environment, but with a specifically crafted OS image. With
>>> the
>>> +instance shut down as it should be in preparation for the move, it can
>>> be
>>> +extended with an additional disk with the OS image on it. By prepending
>>> the
>>> +disk and changing some instance parameters, the instance can boot from
>>> it. The
>>> +OS can be configured to perform the zeroing on startup, attempting to
>>> mount any
>>> +partitions with a filesystem present, and creating and deleting a
>>> zero-filled
>>> +file on them. After the zeroing is complete, the OS should shut down,
>>> and the
>>> +master should note the shutdown and restore the instance to its
>>> previous state.
>>> +
>>> +Note that the requirements above are very similar to the notion of a
>>> helper VM
>>> +suggested in the OS install document. Some potentially unsafe actions
>>> are
>>> +performed within a virtualized environment, acting on disks that belong
>>> or will
>>> +belong to the instance. The mechanisms used will thus be developed with
>>> both
>>> +approaches in mind.
>>> +
>>> +Implementation
>>> +++++++++++++++
>>> +
>>> +There are two components to this solution - the Ganeti changes needed
>>> to boot
>>> +the OS, and the OS image used for the zeroing. Due to the variety of
>>> filesystems
>>> +and architectures that instances can use, no single ready-to-run disk
>>> image can
>>> +satisfy the needs of all the Ganeti users. Instead, the
>>> instance-debootstrap
>>> +scripts can be used to generate a zeroing-capable OS image. This might
>>> not be
>>> +ideal, as there are lightweight distributions that take up less space
>>> and boot
>>> +up more quickly. Generating those with the right set of drivers for the
>>> +virtualization platform of choice is not easy. Thus we do not provide a
>>> script
>>> +for this purpose, but the user is free to provide any OS image which
>>> performs
>>> +the necessary steps: zero out all virtualization-provided devices on
>>> startup,
>>> +shutdown immediately. The cluster-wide parameter controlling the image
>>> to be
>>> +used would be called zeroing-image.
>>> +
>>> +The modifications to Ganeti code needed are minor. The zeroing
>>> functionality
>>> +should be implemented as an extension of the instance export, and
>>> exposed as the
>>> +--zero-free-space option. Prior to beginning the export, the instance
>>> +configuration is temporarily extended with a new read-only disk of
>>> sufficient
>>> +size to host the zeroing image, and the changes necessary for the image
>>> to be
>>> +used as the boot drive. The temporary nature of the configuration
>>> changes
>>> +requires that they are not propagated to other nodes. While this would
>>> normally
>>> +not be feasible with an instance using a disk template offering
>>> multi-node
>>> +redundancy, experiments with the code have shown that the restriction on
>>> +diverse disk templates can be bypassed to temporarily allow a plain
>>> +disk-template disk to host the zeroing image. The image is dumped to
>>> the disk,
>>> +and the instance is started up.
>>> +
>>> +Once the instance is started up, the zeroing will proceed until
>>> completion, when
>>> +a self-initiated shutdown will occur. The instance-shutdown detection
>>> +capabilities of 2.11 should prevent the watcher from restarting the
>>> instance
>>> +once this happens, allowing the host to take it as a sign the zeroing
>>> was
>>> +completed. Either way, the host waits until the instance is shut down,
>>> or a
>>> +user-defined timeout has been reached and the instance is forcibly shut
>>> down.
>>>
>>
>> This timeout should be dependent on the size of the disks of the
>> instance. Zeroing 300GB can take some time, and such instances could
>> happily exist next to 10GB ones...
>>
>>
>
> A valid point, but I am a bit suspicious whether the user can provide a
> good guess for the size factor, and shutting down too early has
> consequences, as discussed in the document.
>
> The point of the timeout would be to kill the VM after enough time has
> passed that the user is sure that something has gone wrong, and wishes to
> end the attempt. This is the only way to do it, as the current version of
> Ganeti cannot end running jobs. There are plans for this to change, but for
> the time being, some mechanism has to be provided.
>
> Additionally, a fixed timeout is necessary - the zeroing image can be
> user-provided, and there's no way of telling how long startup will take, as
> this may include setting up whatever mechanisms are needed for the instance
> communication.
>
> That said, I am not against having two timeout parameters - the fixed one
> and one size factor. I would just suggest that the default is zero for the
> size factor and a very conservative value for the fixed one.
> With instance communication in place, the size factor should be ignored in
> favor of the real-time reports.
>

Ok, makes sense. Just keep in mind that there are instances with >8TB of
disk around, and choosing a conservative fixed timeout which also works for
those might be a bit difficult.


>
>
>>  +
>>> +Better progress monitoring can be implemented with the instance-host
>>> +communication channel proposed by the OS install design document. The
>>> first
>>> +version will most likely use only the shutdown detection, and will be
>>> improved
>>> +to account for the available communication channel at a later time.
>>> +
>>> +After the shutdown, the temporary disk is destroyed and the instance
>>> +configuration is reverted to its original state. The very same action
>>> is done if
>>> +any error is encountered during the zeroing process. In the case that
>>> the
>>> +zeroing is interrupted while the zero-filled file is being written,
>>> there is
>>> +little that can be done to recover. One precautionary measure is to
>>> place the
>>> +file in the /tmp directory on Unix systems, if one exists and can be
>>> identified
>>> +as such. Even if TmpFS is mounted there, it is the most likely location
>>> to be
>>> +cleaned up in case of failure.
>>>
>>
>> If TmpFS is mounted there, it would hide the zero-file from the user and
>> making it thus harder to recover manually from such a problem. Also, if the
>> filesystem is not the root filesystem of the guest but usually mounted
>> under e.g. /home, there wouldn't be a /tmp directory... Anyway, both
>> approaches have advantages and disadvantages, so I would personally go for
>> the easier one.
>>
>>
> Not to mention that this might be used to move another type of OS, and
> there putting things into /tmp might be considered an obfuscation :)
> No /tmp it is.
>
>
>> Another note: the OS image could/should also zero all swap partition
>> completely in order to save some more space.
>>
>
>  Ack, will include it in the doc.
>
>
>>
>> Something I'm missing in this part of the design is a discussion of
>> compression-methods (maybe with a lot of zeros something really fast can be
>> used) and/or a (semi-) automated way of figuring out if zeroing+compression
>> is faster than just sending the whole data. I agree that this is a bit out
>> of scope for now, but the user should at least have the option to enable or
>> disable zeroing. For future work, move-instance could get a rough
>> measurement of the throughput between the clusters and could then decide
>> based on the size of the instance disks and some heuristics if zeroing
>> makes sense.
>>
>
> I am completely in favor of allowing the user to enable or disable zeroing
> through the --zero-free-space option, but I should probably make that more
> clear in the design document.
>
> I think that the heuristic would be troublesome because the transfer speed
> is dependent on:
>
> - compression / decompression speed - algorithm dependent
> - encryption / decryption speed - algorithm dependent
> - bandwidth
>
> and the overall duration is then altered by the size of the compressed
> file. We could choose the best value, but for that we would need to supply
> or perform measurements of the performance of compression and encryption
> algorithms, and the free space ratio for which these were recorded.
>
> I'd much rather leave the choice of compression algorithm to the user, and
> provide a decent default based on what we use.
> Anyone who performs enough instance moves to care about performance will
> probably be in a position to perform some benchmarks and set the best
> parameters on their own.
>
> Maybe a --perform-zeroing-under-free-space-ratio parameter, unset by
> default, would be a good compromise? Or some sort of hook present for the
> ExportInstance opcode/LU?
>

I would list a couple of those ideas as further work and see what's
actually asked for. A complete automatic approach which measures the
link/parameters first and then performs some magic would be cool, but
really hard to implement in a very robust way... BTW, a
--perform-zeroing-under-free-space-ratio might be hard to implement,
because the only the zeroing VM knows about the free space on the disk, but
this VM could be user supplied.


>
>
>> Another thing missing is the discussion of encryption algorithms. The
>> method to encrypt the data sent from one cluster to the other can be
>> configured and plays quite a big role throughput-wise. We could give users
>> the choice to use another (possibly weaker) encryption if they want more
>> speed and/or review the choice we've made.
>>
>
> I focused on the cross-cluster case in this document, and there I'd be
> surprised if encryption trumped bandwidth as the limiting factor. For
> inter-cluster moves, certainly, and I'd guess some users would appreciate
> the "none" option as well. Will add this to the document.
>

I guess we should just leave he choice to the user. If the data on the VM
is not sensitive, no encryption might be good enough if the data resides in
one data-center (but not in the same cluster).


>
>>
>>> +
>>> +Lock reduction
>>> +==============
>>> +
>>> +An instance move as executed by the move-instance tool consists of
>>> several
>>> +preparatory RAPI calls, leading up to two long-lasting opcodes:
>>> OpCreateInstance
>>> +and OpBackupExport. While OpBackupExport locks only the instance, the
>>> locks of
>>> +OpCreateInstance require more attention.
>>> +
>>> +When executed, this opcode attempts to lock all nodes on which the
>>> instance may
>>> +be created and obtain shared locks on the groups they belong to. In the
>>> case
>>> +that an IAllocator is used, this means all nodes must be locked. Any
>>> operation
>>> +that requires a node lock to be present can delay the move operation,
>>> and there
>>> +is no shortage of these.
>>> +
>>> +The concept of opportunistic locking has been introduced to remedy
>>> exactly this
>>> +situation, allowing the IAllocator to grab as many node locks as
>>> possible.
>>> +Depending on how many nodes were available, the operation either
>>> proceeds as
>>> +expected, or fails noting that it is temporarily infeasible. The
>>> failure case
>>> +is unacceptable for the move-instance tool, which is expected to fail
>>> only if
>>> +the move is impossible. To yield the benefits of opportunistic locking
>>> yet
>>> +satisfy this constraint, the move-instance tool can be extended with the
>>> +--opportunistic-tries and --opportunistic-try-delay options. A number of
>>> +opportunistic instance creations are attempted, with a delay between
>>> attempts.
>>> +Should they all fail, a normal and blocking instance creation is
>>> requested.
>>> +
>>> +While it may seem excessive to grab so many node locks, the early
>>> release
>>> +mechanism is used to make the situation less dire, releasing all nodes
>>> that were
>>> +not chosen as candidates for allocation. This is taken to the extreme
>>> as all the
>>> +locks acquired are released prior to the start of the transfer, barring
>>> the
>>> +newly-acquired lock over the new instance. This works because all
>>> operations
>>> +that alter the node in a way which could affect the transfer:
>>> +
>>> +* are prevented by the instance lock or instance presence, e.g.
>>> gnt-node remove,
>>> +  gnt-node evacuate,
>>> +
>>> +* do not interrupt the transfer, e.g. a PV on the node can be set as
>>> +  unallocatable, and the transfer still proceeds as expected,
>>> +
>>> +* do not care, e.g. a gnt-node powercycle explicitly ignores all locks.
>>> +
>>> +This is an invariant to be kept in mind for future development, but at
>>> the
>>> +current time, no additional locks are needed.
>>> +
>>> +Introduction of changes
>>> +=======================
>>> +
>>> +Both the instance zeroing and the lock reduction will be implemented as
>>> a part
>>> +of Ganeti 2.12, in the way described in the previous chapters. They
>>> will be
>>> +implemented as separate changes, first the lock reduction, and then the
>>> instance
>>> +zeroing due to the implementation overlapping and benefitting from the
>>> changes
>>> +needed for the OS installation improvements.
>>> --
>>> 1.7.10.4
>>>
>>>
>> Would it make sense to share this design doc as well with the SRE's? I
>> know that climent@ filed the bug about instance moves, but he's not
>> working on it any more. So ganeti-sre@ or ganeti-team@ might be
>> appropriate.
>>
>> Cheers,
>> Thomas
>>
>>
>> --
>> Thomas Thrainer | Software Engineer | [email protected] |
>>
>> Google Germany GmbH
>> Dienerstr. 12
>> 80331 München
>>
>> Registergericht und -nummer: Hamburg, HRB 86891
>> Sitz der Gesellschaft: Hamburg
>> Geschäftsführer: Graham Law, Christine Elizabeth Flores
>>
>
>


-- 
Thomas Thrainer | Software Engineer | [email protected] |

Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores

Re: [RFC master] Add move instance improvements design document

Reply via email to