Re: [RFC master] Add move instance improvements design document

Hrvoje Ribicic Tue, 25 Feb 2014 01:55:18 -0800

To address many of the comments here, I made significant changes to the
document, and sent it out as a separate mail which should have a very
similar title.



On Tue, Feb 11, 2014 at 8:48 AM, Thomas Thrainer <[email protected]>wrote:

>
>
>
> On Fri, Feb 7, 2014 at 3:24 PM, Hrvoje Ribicic <[email protected]> wrote:
>
>> On Fri, Feb 7, 2014 at 10:17 AM, Petr Pudlák <[email protected]> wrote:
>>
>>>
>>>
>>>
>>> On Thu, Feb 6, 2014 at 9:47 AM, Thomas Thrainer <[email protected]>wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 4, 2014 at 6:05 PM, Hrvoje Ribicic <[email protected]> wrote:
>>>>
>>>>> This patch adds a design document exploring zeroing and lock reduction
>>>>> as options for the improved performance and parallelism of
>>>>> cross-cluster instance moves.
>>>>>
>>>>> Signed-off-by: Hrvoje Ribicic <[email protected]>
>>>>> ---
>>>>>  doc/design-move-instance-improvements.rst |  182
>>>>> +++++++++++++++++++++++++++++
>>>>>  1 file changed, 182 insertions(+)
>>>>>  create mode 100644 doc/design-move-instance-improvements.rst
>>>>>
>>>>> diff --git a/doc/design-move-instance-improvements.rst
>>>>> b/doc/design-move-instance-improvements.rst
>>>>> new file mode 100644
>>>>> index 0000000..22b4bf5
>>>>> --- /dev/null
>>>>> +++ b/doc/design-move-instance-improvements.rst
>>>>> @@ -0,0 +1,182 @@
>>>>> +========================================
>>>>> +Cross-cluster instance move improvements
>>>>> +========================================
>>>>> +
>>>>> +.. contents:: :depth: 3
>>>>> +
>>>>> +To move instances across clusters, Ganeti provides the move-instance
>>>>> tool. It
>>>>> +uses the RAPI to create new instances in the destination cluster,
>>>>> ready to
>>>>> +import data from instances in the source cluster.
>>>>> +
>>>>> +The tool works correctly and reliably, but depending on bandwidth and
>>>>> priority,
>>>>> +an instance disk of considerable size requires a long time to
>>>>> transfer. This is
>>>>> +inconvenient at best, and can be remedied by either reducing the
>>>>> length of the
>>>>> +transfers, or allowing more operations to run in parallel with
>>>>> instance moves.
>>>>> +
>>>>> +The former can be achieved through the zeroing of empty space on
>>>>> instance disks
>>>>> +and compressing them prior to transfer, and the latter by reducing
>>>>> the amount of
>>>>> +locking happening during an instance move. As the approaches aim to
>>>>> tackle two
>>>>> +different aspects of the problem, they do not exclude each other and
>>>>> will be
>>>>> +presented independently.
>>>>> +
>>>>> +Zeroing instance disks
>>>>> +======================
>>>>> +
>>>>> +Support for disk compression during instance moves was partially
>>>>> present before,
>>>>> +but cleaned up and explicitly added as the --compress option only as
>>>>> of Ganeti
>>>>> +2.10. While compression lowers the amount of data sent, further gains
>>>>> can be
>>>>> +achieved by taking advantage of the structure of the disk - namely,
>>>>> sending only
>>>>> +used disk sectors.
>>>>> +
>>>>> +There is no direct way to achieve this, as it would require that the
>>>>> +move-instance tool is aware of the structure of the file system.
>>>>> Mounting the
>>>>> +filesystem is not an option, primarily due to security issues. A disk
>>>>> primed to
>>>>> +take advantage of a disk driver exploit could cause an attacker to
>>>>> breach
>>>>> +instance isolation and gain control of a Ganeti node.
>>>>> +
>>>>> +An indirect way for this performance gain to be achieved is the
>>>>> zeroing of the
>>>>> +empty hard disk space. Sequences of zeroes can be compressed and thus
>>>>> +transferred very efficiently, all without the host knowing that these
>>>>> are empty
>>>>> +space. This approach can also be dangerous if a sparse disk is zeroed
>>>>> in this
>>>>> +way, causing ballooning. As Ganeti does not seem to make special
>>>>> concessions for
>>>>> +moving sparse disks, the only difference should be the disk space
>>>>> utilization
>>>>> +on the current node.
>>>>> +
>>>>> +Zeroing approaches
>>>>> +++++++++++++++++++
>>>>> +
>>>>> +Zeroing is a feasible approach, but the node cannot perform it as it
>>>>> cannot
>>>>> +mount the disk. Only virtualization-based options remain, and of
>>>>> those, using
>>>>> +Ganeti's own virtualization capabilities makes the most sense. There
>>>>> are two
>>>>> +ways of doing this - creating a new helper instance, temporary or
>>>>> persistent, or
>>>>> +reusing the target instance.
>>>>> +
>>>>> +Both approaches have their disadvantages. Creating a new helper
>>>>> instance
>>>>> +requires managing its lifecycle, taking special care to make sure no
>>>>> helper
>>>>> +instance remains left over due to a failed operation. Even if this
>>>>> were to be
>>>>> +taken care of, disks are not yet separate entities in Ganeti, making
>>>>> the
>>>>> +temporary transfer of disks between instances hard to implement and
>>>>> even harder
>>>>> +to make robust. The reuse can be done by modifying the OS running on
>>>>> the
>>>>> +instance to perform the zeroing itself when notified via the new
>>>>> instance
>>>>> +communication mechanism, but this approach is neither generic, nor
>>>>> particularly
>>>>> +safe. There is no guarantee that the zeroing operation will not
>>>>> interfere with
>>>>> +the normal operation of the instance, nor that it will be completed
>>>>> if a
>>>>> +user-initiated shutdown occurs.
>>>>> +
>>>>> +A better solution can be found by combining the two approaches -
>>>>> re-using the
>>>>> +virtualized environment, but with a specifically crafted OS image.
>>>>> With the
>>>>> +instance shut down as it should be in preparation for the move, it
>>>>> can be
>>>>> +extended with an additional disk with the OS image on it. By
>>>>> prepending the
>>>>> +disk and changing some instance parameters, the instance can boot
>>>>> from it. The
>>>>> +OS can be configured to perform the zeroing on startup, attempting to
>>>>> mount any
>>>>> +partitions with a filesystem present, and creating and deleting a
>>>>> zero-filled
>>>>> +file on them. After the zeroing is complete, the OS should shut down,
>>>>> and the
>>>>> +master should note the shutdown and restore the instance to its
>>>>> previous state.
>>>>> +
>>>>> +Note that the requirements above are very similar to the notion of a
>>>>> helper VM
>>>>> +suggested in the OS install document. Some potentially unsafe actions
>>>>> are
>>>>> +performed within a virtualized environment, acting on disks that
>>>>> belong or will
>>>>> +belong to the instance. The mechanisms used will thus be developed
>>>>> with both
>>>>> +approaches in mind.
>>>>> +
>>>>> +Implementation
>>>>> +++++++++++++++
>>>>> +
>>>>> +There are two components to this solution - the Ganeti changes needed
>>>>> to boot
>>>>> +the OS, and the OS image used for the zeroing. Due to the variety of
>>>>> filesystems
>>>>> +and architectures that instances can use, no single ready-to-run disk
>>>>> image can
>>>>> +satisfy the needs of all the Ganeti users. Instead, the
>>>>> instance-debootstrap
>>>>> +scripts can be used to generate a zeroing-capable OS image. This
>>>>> might not be
>>>>> +ideal, as there are lightweight distributions that take up less space
>>>>> and boot
>>>>> +up more quickly. Generating those with the right set of drivers for
>>>>> the
>>>>> +virtualization platform of choice is not easy. Thus we do not provide
>>>>> a script
>>>>> +for this purpose, but the user is free to provide any OS image which
>>>>> performs
>>>>> +the necessary steps: zero out all virtualization-provided devices on
>>>>> startup,
>>>>> +shutdown immediately. The cluster-wide parameter controlling the
>>>>> image to be
>>>>> +used would be called zeroing-image.
>>>>> +
>>>>> +The modifications to Ganeti code needed are minor. The zeroing
>>>>> functionality
>>>>> +should be implemented as an extension of the instance export, and
>>>>> exposed as the
>>>>> +--zero-free-space option. Prior to beginning the export, the instance
>>>>> +configuration is temporarily extended with a new read-only disk of
>>>>> sufficient
>>>>> +size to host the zeroing image, and the changes necessary for the
>>>>> image to be
>>>>> +used as the boot drive. The temporary nature of the configuration
>>>>> changes
>>>>> +requires that they are not propagated to other nodes. While this
>>>>> would normally
>>>>> +not be feasible with an instance using a disk template offering
>>>>> multi-node
>>>>> +redundancy, experiments with the code have shown that the restriction
>>>>> on
>>>>> +diverse disk templates can be bypassed to temporarily allow a plain
>>>>> +disk-template disk to host the zeroing image. The image is dumped to
>>>>> the disk,
>>>>> +and the instance is started up.
>>>>> +
>>>>> +Once the instance is started up, the zeroing will proceed until
>>>>> completion, when
>>>>> +a self-initiated shutdown will occur. The instance-shutdown detection
>>>>> +capabilities of 2.11 should prevent the watcher from restarting the
>>>>> instance
>>>>> +once this happens, allowing the host to take it as a sign the zeroing
>>>>> was
>>>>> +completed. Either way, the host waits until the instance is shut
>>>>> down, or a
>>>>> +user-defined timeout has been reached and the instance is forcibly
>>>>> shut down.
>>>>>
>>>>
>>>> This timeout should be dependent on the size of the disks of the
>>>> instance. Zeroing 300GB can take some time, and such instances could
>>>> happily exist next to 10GB ones...
>>>>
>>>>
>>>>> +
>>>>> +Better progress monitoring can be implemented with the instance-host
>>>>> +communication channel proposed by the OS install design document. The
>>>>> first
>>>>> +version will most likely use only the shutdown detection, and will be
>>>>> improved
>>>>> +to account for the available communication channel at a later time.
>>>>> +
>>>>> +After the shutdown, the temporary disk is destroyed and the instance
>>>>> +configuration is reverted to its original state. The very same action
>>>>> is done if
>>>>> +any error is encountered during the zeroing process. In the case that
>>>>> the
>>>>> +zeroing is interrupted while the zero-filled file is being written,
>>>>> there is
>>>>> +little that can be done to recover. One precautionary measure is to
>>>>> place the
>>>>> +file in the /tmp directory on Unix systems, if one exists and can be
>>>>> identified
>>>>> +as such. Even if TmpFS is mounted there, it is the most likely
>>>>> location to be
>>>>> +cleaned up in case of failure.
>>>>>
>>>>
>>>> If TmpFS is mounted there, it would hide the zero-file from the user
>>>> and making it thus harder to recover manually from such a problem. Also, if
>>>> the filesystem is not the root filesystem of the guest but usually mounted
>>>> under e.g. /home, there wouldn't be a /tmp directory... Anyway, both
>>>> approaches have advantages and disadvantages, so I would personally go for
>>>> the easier one.
>>>>
>>>
>>> I wouldn't be so sure that about /tmp being cleaned up, if it's a
>>> mount-point for TmpFS or another separate partition. I guess an OS first
>>> mounts partitions and only then cleans up /tmp.
>>>
>>>
>> Ack - that part will be removed.
>>
>>
>>>
>>>> Another note: the OS image could/should also zero all swap partition
>>>> completely in order to save some more space.
>>>>
>>>>
>>>> Something I'm missing in this part of the design is a discussion of
>>>> compression-methods (maybe with a lot of zeros something really fast can be
>>>> used)
>>>>
>>>
>>> I've had a good experience with lzop: http://en.wthe gains are
>>> dependent on the gains are dependent on 
>>> ikipedia.org/wiki/Lzop<http://en.wikipedia.org/wiki/Lzop>
>>>  It's _very_ fast compared to other compression tools, so definitely it
>>> wouldn't be a bottleneck, and for blocks of zeroes it would work just as
>>> well as any other algorithm. I tried to compress 1GB of zeroes, it took
>>> 0.5s and got compressed into 4.5MB:
>>>
>>>  dd bs=1MB count=1024 if=/dev/zero | lzop | wc --bytes
>>> 1024+0 records in
>>> 1024+0 records out
>>> 1024000000 bytes (1.0 GB) copied, 2.50511 s, 409 MB/s
>>> 4668035
>>>
>>>
>>
>> I actually do not know if 1GB of zeroes is a good benchmark for a
>> compression tool in this case. With an ext? filesystem, the empty space is
>> likely to be very fragmented, with pockets of zeroed free space scattered
>> amongst files. My hunch is also that speed rules as the ratio will be just
>> about the same for all compression tools, but I would like to do some
>> testing on a more realistic-looking drive first. The choice of compression
>> tool used would certainly be added as an option.
>>
>
> That was the result of my tests as well - slower compression algorithms
> didn't produce much smaller results but took way longer. So lzop might be a
> good option if it's available.
>
>
>>
>>
>>>  and/or a (semi-) automated way of figuring out if zeroing+compression
>>>> is faster than just sending the whole data. I agree that this is a bit out
>>>> of scope for now, but the user should at least have the option to enable or
>>>> disable zeroing. For future work, move-instance could get a rough
>>>> measurement of the throughput between the clusters and could then decide
>>>> based on the size of the instance disks and some heuristics if zeroing
>>>> makes sense.
>>>>
>>>> Another thing missing is the discussion of encryption algorithms. The
>>>> method to encrypt the data sent from one cluster to the other can be
>>>> configured and plays quite a big role throughput-wise. We could give users
>>>> the choice to use another (possibly weaker) encryption if they want more
>>>> speed and/or review the choice we've made.
>>>>
>>>
>>> It'd be interesting to make some tests and measure the impact of various
>>> encryption algorithms. I remember using Blowfish with SSH to reduce CPU
>>> load and speed up transfers, but perhaps nowadays with faster CPUs and
>>> optimizations in encryption algorithms the difference isn't so large.
>>>
>>
>> I guess that for cross-cluster transfers, the limiting factor is the
>> bandwidth and not the speed of encryption, but I might be wrong. Testing it
>> is :)
>>
>
> As far as I can remember, encryption was dead-slow in my benchmarks and
> actually was the limiting factor (except when compression with a full disk
> was used, which resulted in really bad compression performance). Without
> encryption moves could be performed almost twice as quickly in some cases.
>
>
>>
>>
>>>
>>>
>>>>
>>>>
>>>>> +
>>>>> +Lock reduction
>>>>> +==============
>>>>> +
>>>>> +An instance move as executed by the move-instance tool consists of
>>>>> several
>>>>> +preparatory RAPI calls, leading up to two long-lasting opcodes:
>>>>> OpCreateInstance
>>>>> +and OpBackupExport. While OpBackupExport locks only the instance, the
>>>>> locks of
>>>>> +OpCreateInstance require more attention.
>>>>> +
>>>>> +When executed, this opcode attempts to lock all nodes on which the
>>>>> instance may
>>>>> +be created and obtain shared locks on the groups they belong to. In
>>>>> the case
>>>>> +that an IAllocator is used, this means all nodes must be locked. Any
>>>>> operation
>>>>> +that requires a node lock to be present can delay the move operation,
>>>>> and there
>>>>> +is no shortage of these.
>>>>> +
>>>>> +The concept of opportunistic locking has been introduced to remedy
>>>>> exactly this
>>>>> +situation, allowing the IAllocator to grab as many node locks as
>>>>> possible.
>>>>> +Depending on how many nodes were available, the operation either
>>>>> proceeds as
>>>>> +expected, or fails noting that it is temporarily infeasible. The
>>>>> failure case
>>>>> +is unacceptable for the move-instance tool, which is expected to fail
>>>>> only if
>>>>> +the move is impossible. To yield the benefits of opportunistic
>>>>> locking yet
>>>>> +satisfy this constraint, the move-instance tool can be extended with
>>>>> the
>>>>> +--opportunistic-tries and --opportunistic-try-delay options. A number
>>>>> of
>>>>> +opportunistic instance creations are attempted, with a delay between
>>>>> attempts.
>>>>>
>>>>
>>> Definitely the delays should be randomized to avoid inadvertently
>>> synchronized simultaneous attempts by multiple jobs.
>>>
>>
>> Ack.
>>
>>
>>>
>>>
>>>>  +Should they all fail, a normal and blocking instance creation is
>>>>> requested.
>>>>>
>>>>
>>> I don't fully understand this. Does it mean that if opportunistic
>>> locking using an IAllocator fails, it'd fall back to just trying to pick up
>>> any node (or any two nodes) available?
>>>
>>
>> No, it'd fall back to a non-opportunistic use of an IAllocator, blocking
>> the execution of the move until all the node locks on the target cluster
>> can be acquired. Will rewrite.
>>
>>>
>>>
>>>>  +
>>>>> +While it may seem excessive to grab so many node locks, the early
>>>>> release
>>>>> +mechanism is used to make the situation less dire, releasing all
>>>>> nodes that were
>>>>> +not chosen as candidates for allocation. This is taken to the extreme
>>>>> as all the
>>>>> +locks acquired are released prior to the start of the transfer,
>>>>> barring the
>>>>> +newly-acquired lock over the new instance. This works because all
>>>>> operations
>>>>> +that alter the node in a way which could affect the transfer:
>>>>> +
>>>>> +* are prevented by the instance lock or instance presence, e.g.
>>>>> gnt-node remove,
>>>>> +  gnt-node evacuate,
>>>>> +
>>>>> +* do not interrupt the transfer, e.g. a PV on the node can be set as
>>>>> +  unallocatable, and the transfer still proceeds as expected,
>>>>> +
>>>>> +* do not care, e.g. a gnt-node powercycle explicitly ignores all
>>>>> locks.
>>>>> +
>>>>> +This is an invariant to be kept in mind for future development, but
>>>>> at the
>>>>> +current time, no additional locks are needed.
>>>>>
>>>>
>>> I'm a bit confused about what is the conclusion of this section. Does it
>>> propose any lock changes (reduction)? Or just proposes adding retries for
>>> instance creation if opportunistic locking fails?
>>>
>>> There is no general reduction in lock types acquired, nor can locks be
>> released earlier. Opportunistic locking may result in earlier execution of
>> operations, but it is just a matter of using it as the feature is already
>> present.
>> I will rewrite this to improve clarity.
>>
>>
>>> Perhaps we should rather aim for improving opportunistic locking in
>>> general, allowing these parameters for all LUs that use opportunistic
>>> locking. There are other LUs that use opportunistic locking as well.
>>>
>>
>> That is a good point, but the scope of this change would be much greater
>> than the one proposed in this design document. When retrying, the
>> move-instance tool can simply issue another creation job, identical to the
>> previous one. Adding the option to the LU itself would mean introducing a
>> mechanism for the automatic retrying of LUs. While this can and probably
>> should be done, it is a much greater refactoring of the jobs in Ganeti and
>> should be undertaken separately.
>>
>>
>>>
>>>  +
>>>>> +Introduction of changes
>>>>> +=======================
>>>>> +
>>>>> +Both the instance zeroing and the lock reduction will be implemented
>>>>> as a part
>>>>> +of Ganeti 2.12, in the way described in the previous chapters. They
>>>>> will be
>>>>> +implemented as separate changes, first the lock reduction, and then
>>>>> the instance
>>>>> +zeroing due to the implementation overlapping and benefitting from
>>>>> the changes
>>>>> +needed for the OS installation improvements.
>>>>> --
>>>>> 1.7.10.4
>>>>>
>>>>>
>>>> Would it make sense to share this design doc as well with the SRE's? I
>>>> know that climent@ filed the bug about instance moves, but he's not
>>>> working on it any more. So ganeti-sre@ or ganeti-team@ might be
>>>> appropriate.
>>>>
>>>> Cheers,
>>>> Thomas
>>>>
>>>>
>>>> --
>>>> Thomas Thrainer | Software Engineer | [email protected] |
>>>>
>>>> Google Germany GmbH
>>>> Dienerstr. 12
>>>> 80331 München
>>>>
>>>> Registergericht und -nummer: Hamburg, HRB 86891
>>>> Sitz der Gesellschaft: Hamburg
>>>> Geschäftsführer: Graham Law, Christine Elizabeth Flores
>>>>
>>>
>>>
>>
>
>
> --
> Thomas Thrainer | Software Engineer | [email protected] |
>
> Google Germany GmbH
> Dienerstr. 12
> 80331 München
>
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> Geschäftsführer: Graham Law, Christine Elizabeth Flores
>

Re: [RFC master] Add move instance improvements design document

Reply via email to