Re: [RFC master] Add move instance improvements design document

Thomas Thrainer Mon, 10 Feb 2014 23:49:55 -0800

On Fri, Feb 7, 2014 at 3:24 PM, Hrvoje Ribicic <[email protected]> wrote:


> On Fri, Feb 7, 2014 at 10:17 AM, Petr Pudlák <[email protected]> wrote:
>
>>
>>
>>
>> On Thu, Feb 6, 2014 at 9:47 AM, Thomas Thrainer <[email protected]>wrote:
>>
>>>
>>>
>>>
>>> On Tue, Feb 4, 2014 at 6:05 PM, Hrvoje Ribicic <[email protected]> wrote:
>>>
>>>> This patch adds a design document exploring zeroing and lock reduction
>>>> as options for the improved performance and parallelism of
>>>> cross-cluster instance moves.
>>>>
>>>> Signed-off-by: Hrvoje Ribicic <[email protected]>
>>>> ---
>>>>  doc/design-move-instance-improvements.rst |  182
>>>> +++++++++++++++++++++++++++++
>>>>  1 file changed, 182 insertions(+)
>>>>  create mode 100644 doc/design-move-instance-improvements.rst
>>>>
>>>> diff --git a/doc/design-move-instance-improvements.rst
>>>> b/doc/design-move-instance-improvements.rst
>>>> new file mode 100644
>>>> index 0000000..22b4bf5
>>>> --- /dev/null
>>>> +++ b/doc/design-move-instance-improvements.rst
>>>> @@ -0,0 +1,182 @@
>>>> +========================================
>>>> +Cross-cluster instance move improvements
>>>> +========================================
>>>> +
>>>> +.. contents:: :depth: 3
>>>> +
>>>> +To move instances across clusters, Ganeti provides the move-instance
>>>> tool. It
>>>> +uses the RAPI to create new instances in the destination cluster,
>>>> ready to
>>>> +import data from instances in the source cluster.
>>>> +
>>>> +The tool works correctly and reliably, but depending on bandwidth and
>>>> priority,
>>>> +an instance disk of considerable size requires a long time to
>>>> transfer. This is
>>>> +inconvenient at best, and can be remedied by either reducing the
>>>> length of the
>>>> +transfers, or allowing more operations to run in parallel with
>>>> instance moves.
>>>> +
>>>> +The former can be achieved through the zeroing of empty space on
>>>> instance disks
>>>> +and compressing them prior to transfer, and the latter by reducing the
>>>> amount of
>>>> +locking happening during an instance move. As the approaches aim to
>>>> tackle two
>>>> +different aspects of the problem, they do not exclude each other and
>>>> will be
>>>> +presented independently.
>>>> +
>>>> +Zeroing instance disks
>>>> +======================
>>>> +
>>>> +Support for disk compression during instance moves was partially
>>>> present before,
>>>> +but cleaned up and explicitly added as the --compress option only as
>>>> of Ganeti
>>>> +2.10. While compression lowers the amount of data sent, further gains
>>>> can be
>>>> +achieved by taking advantage of the structure of the disk - namely,
>>>> sending only
>>>> +used disk sectors.
>>>> +
>>>> +There is no direct way to achieve this, as it would require that the
>>>> +move-instance tool is aware of the structure of the file system.
>>>> Mounting the
>>>> +filesystem is not an option, primarily due to security issues. A disk
>>>> primed to
>>>> +take advantage of a disk driver exploit could cause an attacker to
>>>> breach
>>>> +instance isolation and gain control of a Ganeti node.
>>>> +
>>>> +An indirect way for this performance gain to be achieved is the
>>>> zeroing of the
>>>> +empty hard disk space. Sequences of zeroes can be compressed and thus
>>>> +transferred very efficiently, all without the host knowing that these
>>>> are empty
>>>> +space. This approach can also be dangerous if a sparse disk is zeroed
>>>> in this
>>>> +way, causing ballooning. As Ganeti does not seem to make special
>>>> concessions for
>>>> +moving sparse disks, the only difference should be the disk space
>>>> utilization
>>>> +on the current node.
>>>> +
>>>> +Zeroing approaches
>>>> +++++++++++++++++++
>>>> +
>>>> +Zeroing is a feasible approach, but the node cannot perform it as it
>>>> cannot
>>>> +mount the disk. Only virtualization-based options remain, and of
>>>> those, using
>>>> +Ganeti's own virtualization capabilities makes the most sense. There
>>>> are two
>>>> +ways of doing this - creating a new helper instance, temporary or
>>>> persistent, or
>>>> +reusing the target instance.
>>>> +
>>>> +Both approaches have their disadvantages. Creating a new helper
>>>> instance
>>>> +requires managing its lifecycle, taking special care to make sure no
>>>> helper
>>>> +instance remains left over due to a failed operation. Even if this
>>>> were to be
>>>> +taken care of, disks are not yet separate entities in Ganeti, making
>>>> the
>>>> +temporary transfer of disks between instances hard to implement and
>>>> even harder
>>>> +to make robust. The reuse can be done by modifying the OS running on
>>>> the
>>>> +instance to perform the zeroing itself when notified via the new
>>>> instance
>>>> +communication mechanism, but this approach is neither generic, nor
>>>> particularly
>>>> +safe. There is no guarantee that the zeroing operation will not
>>>> interfere with
>>>> +the normal operation of the instance, nor that it will be completed if
>>>> a
>>>> +user-initiated shutdown occurs.
>>>> +
>>>> +A better solution can be found by combining the two approaches -
>>>> re-using the
>>>> +virtualized environment, but with a specifically crafted OS image.
>>>> With the
>>>> +instance shut down as it should be in preparation for the move, it can
>>>> be
>>>> +extended with an additional disk with the OS image on it. By
>>>> prepending the
>>>> +disk and changing some instance parameters, the instance can boot from
>>>> it. The
>>>> +OS can be configured to perform the zeroing on startup, attempting to
>>>> mount any
>>>> +partitions with a filesystem present, and creating and deleting a
>>>> zero-filled
>>>> +file on them. After the zeroing is complete, the OS should shut down,
>>>> and the
>>>> +master should note the shutdown and restore the instance to its
>>>> previous state.
>>>> +
>>>> +Note that the requirements above are very similar to the notion of a
>>>> helper VM
>>>> +suggested in the OS install document. Some potentially unsafe actions
>>>> are
>>>> +performed within a virtualized environment, acting on disks that
>>>> belong or will
>>>> +belong to the instance. The mechanisms used will thus be developed
>>>> with both
>>>> +approaches in mind.
>>>> +
>>>> +Implementation
>>>> +++++++++++++++
>>>> +
>>>> +There are two components to this solution - the Ganeti changes needed
>>>> to boot
>>>> +the OS, and the OS image used for the zeroing. Due to the variety of
>>>> filesystems
>>>> +and architectures that instances can use, no single ready-to-run disk
>>>> image can
>>>> +satisfy the needs of all the Ganeti users. Instead, the
>>>> instance-debootstrap
>>>> +scripts can be used to generate a zeroing-capable OS image. This might
>>>> not be
>>>> +ideal, as there are lightweight distributions that take up less space
>>>> and boot
>>>> +up more quickly. Generating those with the right set of drivers for the
>>>> +virtualization platform of choice is not easy. Thus we do not provide
>>>> a script
>>>> +for this purpose, but the user is free to provide any OS image which
>>>> performs
>>>> +the necessary steps: zero out all virtualization-provided devices on
>>>> startup,
>>>> +shutdown immediately. The cluster-wide parameter controlling the image
>>>> to be
>>>> +used would be called zeroing-image.
>>>> +
>>>> +The modifications to Ganeti code needed are minor. The zeroing
>>>> functionality
>>>> +should be implemented as an extension of the instance export, and
>>>> exposed as the
>>>> +--zero-free-space option. Prior to beginning the export, the instance
>>>> +configuration is temporarily extended with a new read-only disk of
>>>> sufficient
>>>> +size to host the zeroing image, and the changes necessary for the
>>>> image to be
>>>> +used as the boot drive. The temporary nature of the configuration
>>>> changes
>>>> +requires that they are not propagated to other nodes. While this would
>>>> normally
>>>> +not be feasible with an instance using a disk template offering
>>>> multi-node
>>>> +redundancy, experiments with the code have shown that the restriction
>>>> on
>>>> +diverse disk templates can be bypassed to temporarily allow a plain
>>>> +disk-template disk to host the zeroing image. The image is dumped to
>>>> the disk,
>>>> +and the instance is started up.
>>>> +
>>>> +Once the instance is started up, the zeroing will proceed until
>>>> completion, when
>>>> +a self-initiated shutdown will occur. The instance-shutdown detection
>>>> +capabilities of 2.11 should prevent the watcher from restarting the
>>>> instance
>>>> +once this happens, allowing the host to take it as a sign the zeroing
>>>> was
>>>> +completed. Either way, the host waits until the instance is shut down,
>>>> or a
>>>> +user-defined timeout has been reached and the instance is forcibly
>>>> shut down.
>>>>
>>>
>>> This timeout should be dependent on the size of the disks of the
>>> instance. Zeroing 300GB can take some time, and such instances could
>>> happily exist next to 10GB ones...
>>>
>>>
>>>> +
>>>> +Better progress monitoring can be implemented with the instance-host
>>>> +communication channel proposed by the OS install design document. The
>>>> first
>>>> +version will most likely use only the shutdown detection, and will be
>>>> improved
>>>> +to account for the available communication channel at a later time.
>>>> +
>>>> +After the shutdown, the temporary disk is destroyed and the instance
>>>> +configuration is reverted to its original state. The very same action
>>>> is done if
>>>> +any error is encountered during the zeroing process. In the case that
>>>> the
>>>> +zeroing is interrupted while the zero-filled file is being written,
>>>> there is
>>>> +little that can be done to recover. One precautionary measure is to
>>>> place the
>>>> +file in the /tmp directory on Unix systems, if one exists and can be
>>>> identified
>>>> +as such. Even if TmpFS is mounted there, it is the most likely
>>>> location to be
>>>> +cleaned up in case of failure.
>>>>
>>>
>>> If TmpFS is mounted there, it would hide the zero-file from the user and
>>> making it thus harder to recover manually from such a problem. Also, if the
>>> filesystem is not the root filesystem of the guest but usually mounted
>>> under e.g. /home, there wouldn't be a /tmp directory... Anyway, both
>>> approaches have advantages and disadvantages, so I would personally go for
>>> the easier one.
>>>
>>
>> I wouldn't be so sure that about /tmp being cleaned up, if it's a
>> mount-point for TmpFS or another separate partition. I guess an OS first
>> mounts partitions and only then cleans up /tmp.
>>
>>
> Ack - that part will be removed.
>
>
>>
>>> Another note: the OS image could/should also zero all swap partition
>>> completely in order to save some more space.
>>>
>>>
>>> Something I'm missing in this part of the design is a discussion of
>>> compression-methods (maybe with a lot of zeros something really fast can be
>>> used)
>>>
>>
>> I've had a good experience with lzop: http://en.wthe gains are dependent
>> on the gains are dependent on 
>> ikipedia.org/wiki/Lzop<http://en.wikipedia.org/wiki/Lzop>
>> It's _very_ fast compared to other compression tools, so definitely it
>> wouldn't be a bottleneck, and for blocks of zeroes it would work just as
>> well as any other algorithm. I tried to compress 1GB of zeroes, it took
>> 0.5s and got compressed into 4.5MB:
>>
>>  dd bs=1MB count=1024 if=/dev/zero | lzop | wc --bytes
>> 1024+0 records in
>> 1024+0 records out
>> 1024000000 bytes (1.0 GB) copied, 2.50511 s, 409 MB/s
>> 4668035
>>
>>
>
> I actually do not know if 1GB of zeroes is a good benchmark for a
> compression tool in this case. With an ext? filesystem, the empty space is
> likely to be very fragmented, with pockets of zeroed free space scattered
> amongst files. My hunch is also that speed rules as the ratio will be just
> about the same for all compression tools, but I would like to do some
> testing on a more realistic-looking drive first. The choice of compression
> tool used would certainly be added as an option.
>

That was the result of my tests as well - slower compression algorithms
didn't produce much smaller results but took way longer. So lzop might be a
good option if it's available.


>
>
>>  and/or a (semi-) automated way of figuring out if zeroing+compression
>>> is faster than just sending the whole data. I agree that this is a bit out
>>> of scope for now, but the user should at least have the option to enable or
>>> disable zeroing. For future work, move-instance could get a rough
>>> measurement of the throughput between the clusters and could then decide
>>> based on the size of the instance disks and some heuristics if zeroing
>>> makes sense.
>>>
>>> Another thing missing is the discussion of encryption algorithms. The
>>> method to encrypt the data sent from one cluster to the other can be
>>> configured and plays quite a big role throughput-wise. We could give users
>>> the choice to use another (possibly weaker) encryption if they want more
>>> speed and/or review the choice we've made.
>>>
>>
>> It'd be interesting to make some tests and measure the impact of various
>> encryption algorithms. I remember using Blowfish with SSH to reduce CPU
>> load and speed up transfers, but perhaps nowadays with faster CPUs and
>> optimizations in encryption algorithms the difference isn't so large.
>>
>
> I guess that for cross-cluster transfers, the limiting factor is the
> bandwidth and not the speed of encryption, but I might be wrong. Testing it
> is :)
>

As far as I can remember, encryption was dead-slow in my benchmarks and
actually was the limiting factor (except when compression with a full disk
was used, which resulted in really bad compression performance). Without
encryption moves could be performed almost twice as quickly in some cases.


>
>
>>
>>
>>>
>>>
>>>> +
>>>> +Lock reduction
>>>> +==============
>>>> +
>>>> +An instance move as executed by the move-instance tool consists of
>>>> several
>>>> +preparatory RAPI calls, leading up to two long-lasting opcodes:
>>>> OpCreateInstance
>>>> +and OpBackupExport. While OpBackupExport locks only the instance, the
>>>> locks of
>>>> +OpCreateInstance require more attention.
>>>> +
>>>> +When executed, this opcode attempts to lock all nodes on which the
>>>> instance may
>>>> +be created and obtain shared locks on the groups they belong to. In
>>>> the case
>>>> +that an IAllocator is used, this means all nodes must be locked. Any
>>>> operation
>>>> +that requires a node lock to be present can delay the move operation,
>>>> and there
>>>> +is no shortage of these.
>>>> +
>>>> +The concept of opportunistic locking has been introduced to remedy
>>>> exactly this
>>>> +situation, allowing the IAllocator to grab as many node locks as
>>>> possible.
>>>> +Depending on how many nodes were available, the operation either
>>>> proceeds as
>>>> +expected, or fails noting that it is temporarily infeasible. The
>>>> failure case
>>>> +is unacceptable for the move-instance tool, which is expected to fail
>>>> only if
>>>> +the move is impossible. To yield the benefits of opportunistic locking
>>>> yet
>>>> +satisfy this constraint, the move-instance tool can be extended with
>>>> the
>>>> +--opportunistic-tries and --opportunistic-try-delay options. A number
>>>> of
>>>> +opportunistic instance creations are attempted, with a delay between
>>>> attempts.
>>>>
>>>
>> Definitely the delays should be randomized to avoid inadvertently
>> synchronized simultaneous attempts by multiple jobs.
>>
>
> Ack.
>
>
>>
>>
>>>  +Should they all fail, a normal and blocking instance creation is
>>>> requested.
>>>>
>>>
>> I don't fully understand this. Does it mean that if opportunistic locking
>> using an IAllocator fails, it'd fall back to just trying to pick up any
>> node (or any two nodes) available?
>>
>
> No, it'd fall back to a non-opportunistic use of an IAllocator, blocking
> the execution of the move until all the node locks on the target cluster
> can be acquired. Will rewrite.
>
>>
>>
>>>  +
>>>> +While it may seem excessive to grab so many node locks, the early
>>>> release
>>>> +mechanism is used to make the situation less dire, releasing all nodes
>>>> that were
>>>> +not chosen as candidates for allocation. This is taken to the extreme
>>>> as all the
>>>> +locks acquired are released prior to the start of the transfer,
>>>> barring the
>>>> +newly-acquired lock over the new instance. This works because all
>>>> operations
>>>> +that alter the node in a way which could affect the transfer:
>>>> +
>>>> +* are prevented by the instance lock or instance presence, e.g.
>>>> gnt-node remove,
>>>> +  gnt-node evacuate,
>>>> +
>>>> +* do not interrupt the transfer, e.g. a PV on the node can be set as
>>>> +  unallocatable, and the transfer still proceeds as expected,
>>>> +
>>>> +* do not care, e.g. a gnt-node powercycle explicitly ignores all locks.
>>>> +
>>>> +This is an invariant to be kept in mind for future development, but at
>>>> the
>>>> +current time, no additional locks are needed.
>>>>
>>>
>> I'm a bit confused about what is the conclusion of this section. Does it
>> propose any lock changes (reduction)? Or just proposes adding retries for
>> instance creation if opportunistic locking fails?
>>
>> There is no general reduction in lock types acquired, nor can locks be
> released earlier. Opportunistic locking may result in earlier execution of
> operations, but it is just a matter of using it as the feature is already
> present.
> I will rewrite this to improve clarity.
>
>
>> Perhaps we should rather aim for improving opportunistic locking in
>> general, allowing these parameters for all LUs that use opportunistic
>> locking. There are other LUs that use opportunistic locking as well.
>>
>
> That is a good point, but the scope of this change would be much greater
> than the one proposed in this design document. When retrying, the
> move-instance tool can simply issue another creation job, identical to the
> previous one. Adding the option to the LU itself would mean introducing a
> mechanism for the automatic retrying of LUs. While this can and probably
> should be done, it is a much greater refactoring of the jobs in Ganeti and
> should be undertaken separately.
>
>
>>
>>  +
>>>> +Introduction of changes
>>>> +=======================
>>>> +
>>>> +Both the instance zeroing and the lock reduction will be implemented
>>>> as a part
>>>> +of Ganeti 2.12, in the way described in the previous chapters. They
>>>> will be
>>>> +implemented as separate changes, first the lock reduction, and then
>>>> the instance
>>>> +zeroing due to the implementation overlapping and benefitting from the
>>>> changes
>>>> +needed for the OS installation improvements.
>>>> --
>>>> 1.7.10.4
>>>>
>>>>
>>> Would it make sense to share this design doc as well with the SRE's? I
>>> know that climent@ filed the bug about instance moves, but he's not
>>> working on it any more. So ganeti-sre@ or ganeti-team@ might be
>>> appropriate.
>>>
>>> Cheers,
>>> Thomas
>>>
>>>
>>> --
>>> Thomas Thrainer | Software Engineer | [email protected] |
>>>
>>> Google Germany GmbH
>>> Dienerstr. 12
>>> 80331 München
>>>
>>> Registergericht und -nummer: Hamburg, HRB 86891
>>> Sitz der Gesellschaft: Hamburg
>>> Geschäftsführer: Graham Law, Christine Elizabeth Flores
>>>
>>
>>
>


-- 
Thomas Thrainer | Software Engineer | [email protected] |

Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores

Re: [RFC master] Add move instance improvements design document

Reply via email to