On Fri, Feb 7, 2014 at 3:24 PM, Hrvoje Ribicic <[email protected]> wrote:
> On Fri, Feb 7, 2014 at 10:17 AM, Petr Pudlák <[email protected]> wrote: > >> >> >> >> On Thu, Feb 6, 2014 at 9:47 AM, Thomas Thrainer <[email protected]>wrote: >> >>> >>> >>> >>> On Tue, Feb 4, 2014 at 6:05 PM, Hrvoje Ribicic <[email protected]> wrote: >>> >>>> This patch adds a design document exploring zeroing and lock reduction >>>> as options for the improved performance and parallelism of >>>> cross-cluster instance moves. >>>> >>>> Signed-off-by: Hrvoje Ribicic <[email protected]> >>>> --- >>>> doc/design-move-instance-improvements.rst | 182 >>>> +++++++++++++++++++++++++++++ >>>> 1 file changed, 182 insertions(+) >>>> create mode 100644 doc/design-move-instance-improvements.rst >>>> >>>> diff --git a/doc/design-move-instance-improvements.rst >>>> b/doc/design-move-instance-improvements.rst >>>> new file mode 100644 >>>> index 0000000..22b4bf5 >>>> --- /dev/null >>>> +++ b/doc/design-move-instance-improvements.rst >>>> @@ -0,0 +1,182 @@ >>>> +======================================== >>>> +Cross-cluster instance move improvements >>>> +======================================== >>>> + >>>> +.. contents:: :depth: 3 >>>> + >>>> +To move instances across clusters, Ganeti provides the move-instance >>>> tool. It >>>> +uses the RAPI to create new instances in the destination cluster, >>>> ready to >>>> +import data from instances in the source cluster. >>>> + >>>> +The tool works correctly and reliably, but depending on bandwidth and >>>> priority, >>>> +an instance disk of considerable size requires a long time to >>>> transfer. This is >>>> +inconvenient at best, and can be remedied by either reducing the >>>> length of the >>>> +transfers, or allowing more operations to run in parallel with >>>> instance moves. >>>> + >>>> +The former can be achieved through the zeroing of empty space on >>>> instance disks >>>> +and compressing them prior to transfer, and the latter by reducing the >>>> amount of >>>> +locking happening during an instance move. As the approaches aim to >>>> tackle two >>>> +different aspects of the problem, they do not exclude each other and >>>> will be >>>> +presented independently. >>>> + >>>> +Zeroing instance disks >>>> +====================== >>>> + >>>> +Support for disk compression during instance moves was partially >>>> present before, >>>> +but cleaned up and explicitly added as the --compress option only as >>>> of Ganeti >>>> +2.10. While compression lowers the amount of data sent, further gains >>>> can be >>>> +achieved by taking advantage of the structure of the disk - namely, >>>> sending only >>>> +used disk sectors. >>>> + >>>> +There is no direct way to achieve this, as it would require that the >>>> +move-instance tool is aware of the structure of the file system. >>>> Mounting the >>>> +filesystem is not an option, primarily due to security issues. A disk >>>> primed to >>>> +take advantage of a disk driver exploit could cause an attacker to >>>> breach >>>> +instance isolation and gain control of a Ganeti node. >>>> + >>>> +An indirect way for this performance gain to be achieved is the >>>> zeroing of the >>>> +empty hard disk space. Sequences of zeroes can be compressed and thus >>>> +transferred very efficiently, all without the host knowing that these >>>> are empty >>>> +space. This approach can also be dangerous if a sparse disk is zeroed >>>> in this >>>> +way, causing ballooning. As Ganeti does not seem to make special >>>> concessions for >>>> +moving sparse disks, the only difference should be the disk space >>>> utilization >>>> +on the current node. >>>> + >>>> +Zeroing approaches >>>> +++++++++++++++++++ >>>> + >>>> +Zeroing is a feasible approach, but the node cannot perform it as it >>>> cannot >>>> +mount the disk. Only virtualization-based options remain, and of >>>> those, using >>>> +Ganeti's own virtualization capabilities makes the most sense. There >>>> are two >>>> +ways of doing this - creating a new helper instance, temporary or >>>> persistent, or >>>> +reusing the target instance. >>>> + >>>> +Both approaches have their disadvantages. Creating a new helper >>>> instance >>>> +requires managing its lifecycle, taking special care to make sure no >>>> helper >>>> +instance remains left over due to a failed operation. Even if this >>>> were to be >>>> +taken care of, disks are not yet separate entities in Ganeti, making >>>> the >>>> +temporary transfer of disks between instances hard to implement and >>>> even harder >>>> +to make robust. The reuse can be done by modifying the OS running on >>>> the >>>> +instance to perform the zeroing itself when notified via the new >>>> instance >>>> +communication mechanism, but this approach is neither generic, nor >>>> particularly >>>> +safe. There is no guarantee that the zeroing operation will not >>>> interfere with >>>> +the normal operation of the instance, nor that it will be completed if >>>> a >>>> +user-initiated shutdown occurs. >>>> + >>>> +A better solution can be found by combining the two approaches - >>>> re-using the >>>> +virtualized environment, but with a specifically crafted OS image. >>>> With the >>>> +instance shut down as it should be in preparation for the move, it can >>>> be >>>> +extended with an additional disk with the OS image on it. By >>>> prepending the >>>> +disk and changing some instance parameters, the instance can boot from >>>> it. The >>>> +OS can be configured to perform the zeroing on startup, attempting to >>>> mount any >>>> +partitions with a filesystem present, and creating and deleting a >>>> zero-filled >>>> +file on them. After the zeroing is complete, the OS should shut down, >>>> and the >>>> +master should note the shutdown and restore the instance to its >>>> previous state. >>>> + >>>> +Note that the requirements above are very similar to the notion of a >>>> helper VM >>>> +suggested in the OS install document. Some potentially unsafe actions >>>> are >>>> +performed within a virtualized environment, acting on disks that >>>> belong or will >>>> +belong to the instance. The mechanisms used will thus be developed >>>> with both >>>> +approaches in mind. >>>> + >>>> +Implementation >>>> +++++++++++++++ >>>> + >>>> +There are two components to this solution - the Ganeti changes needed >>>> to boot >>>> +the OS, and the OS image used for the zeroing. Due to the variety of >>>> filesystems >>>> +and architectures that instances can use, no single ready-to-run disk >>>> image can >>>> +satisfy the needs of all the Ganeti users. Instead, the >>>> instance-debootstrap >>>> +scripts can be used to generate a zeroing-capable OS image. This might >>>> not be >>>> +ideal, as there are lightweight distributions that take up less space >>>> and boot >>>> +up more quickly. Generating those with the right set of drivers for the >>>> +virtualization platform of choice is not easy. Thus we do not provide >>>> a script >>>> +for this purpose, but the user is free to provide any OS image which >>>> performs >>>> +the necessary steps: zero out all virtualization-provided devices on >>>> startup, >>>> +shutdown immediately. The cluster-wide parameter controlling the image >>>> to be >>>> +used would be called zeroing-image. >>>> + >>>> +The modifications to Ganeti code needed are minor. The zeroing >>>> functionality >>>> +should be implemented as an extension of the instance export, and >>>> exposed as the >>>> +--zero-free-space option. Prior to beginning the export, the instance >>>> +configuration is temporarily extended with a new read-only disk of >>>> sufficient >>>> +size to host the zeroing image, and the changes necessary for the >>>> image to be >>>> +used as the boot drive. The temporary nature of the configuration >>>> changes >>>> +requires that they are not propagated to other nodes. While this would >>>> normally >>>> +not be feasible with an instance using a disk template offering >>>> multi-node >>>> +redundancy, experiments with the code have shown that the restriction >>>> on >>>> +diverse disk templates can be bypassed to temporarily allow a plain >>>> +disk-template disk to host the zeroing image. The image is dumped to >>>> the disk, >>>> +and the instance is started up. >>>> + >>>> +Once the instance is started up, the zeroing will proceed until >>>> completion, when >>>> +a self-initiated shutdown will occur. The instance-shutdown detection >>>> +capabilities of 2.11 should prevent the watcher from restarting the >>>> instance >>>> +once this happens, allowing the host to take it as a sign the zeroing >>>> was >>>> +completed. Either way, the host waits until the instance is shut down, >>>> or a >>>> +user-defined timeout has been reached and the instance is forcibly >>>> shut down. >>>> >>> >>> This timeout should be dependent on the size of the disks of the >>> instance. Zeroing 300GB can take some time, and such instances could >>> happily exist next to 10GB ones... >>> >>> >>>> + >>>> +Better progress monitoring can be implemented with the instance-host >>>> +communication channel proposed by the OS install design document. The >>>> first >>>> +version will most likely use only the shutdown detection, and will be >>>> improved >>>> +to account for the available communication channel at a later time. >>>> + >>>> +After the shutdown, the temporary disk is destroyed and the instance >>>> +configuration is reverted to its original state. The very same action >>>> is done if >>>> +any error is encountered during the zeroing process. In the case that >>>> the >>>> +zeroing is interrupted while the zero-filled file is being written, >>>> there is >>>> +little that can be done to recover. One precautionary measure is to >>>> place the >>>> +file in the /tmp directory on Unix systems, if one exists and can be >>>> identified >>>> +as such. Even if TmpFS is mounted there, it is the most likely >>>> location to be >>>> +cleaned up in case of failure. >>>> >>> >>> If TmpFS is mounted there, it would hide the zero-file from the user and >>> making it thus harder to recover manually from such a problem. Also, if the >>> filesystem is not the root filesystem of the guest but usually mounted >>> under e.g. /home, there wouldn't be a /tmp directory... Anyway, both >>> approaches have advantages and disadvantages, so I would personally go for >>> the easier one. >>> >> >> I wouldn't be so sure that about /tmp being cleaned up, if it's a >> mount-point for TmpFS or another separate partition. I guess an OS first >> mounts partitions and only then cleans up /tmp. >> >> > Ack - that part will be removed. > > >> >>> Another note: the OS image could/should also zero all swap partition >>> completely in order to save some more space. >>> >>> >>> Something I'm missing in this part of the design is a discussion of >>> compression-methods (maybe with a lot of zeros something really fast can be >>> used) >>> >> >> I've had a good experience with lzop: http://en.wthe gains are dependent >> on the gains are dependent on >> ikipedia.org/wiki/Lzop<http://en.wikipedia.org/wiki/Lzop> >> It's _very_ fast compared to other compression tools, so definitely it >> wouldn't be a bottleneck, and for blocks of zeroes it would work just as >> well as any other algorithm. I tried to compress 1GB of zeroes, it took >> 0.5s and got compressed into 4.5MB: >> >> dd bs=1MB count=1024 if=/dev/zero | lzop | wc --bytes >> 1024+0 records in >> 1024+0 records out >> 1024000000 bytes (1.0 GB) copied, 2.50511 s, 409 MB/s >> 4668035 >> >> > > I actually do not know if 1GB of zeroes is a good benchmark for a > compression tool in this case. With an ext? filesystem, the empty space is > likely to be very fragmented, with pockets of zeroed free space scattered > amongst files. My hunch is also that speed rules as the ratio will be just > about the same for all compression tools, but I would like to do some > testing on a more realistic-looking drive first. The choice of compression > tool used would certainly be added as an option. > That was the result of my tests as well - slower compression algorithms didn't produce much smaller results but took way longer. So lzop might be a good option if it's available. > > >> and/or a (semi-) automated way of figuring out if zeroing+compression >>> is faster than just sending the whole data. I agree that this is a bit out >>> of scope for now, but the user should at least have the option to enable or >>> disable zeroing. For future work, move-instance could get a rough >>> measurement of the throughput between the clusters and could then decide >>> based on the size of the instance disks and some heuristics if zeroing >>> makes sense. >>> >>> Another thing missing is the discussion of encryption algorithms. The >>> method to encrypt the data sent from one cluster to the other can be >>> configured and plays quite a big role throughput-wise. We could give users >>> the choice to use another (possibly weaker) encryption if they want more >>> speed and/or review the choice we've made. >>> >> >> It'd be interesting to make some tests and measure the impact of various >> encryption algorithms. I remember using Blowfish with SSH to reduce CPU >> load and speed up transfers, but perhaps nowadays with faster CPUs and >> optimizations in encryption algorithms the difference isn't so large. >> > > I guess that for cross-cluster transfers, the limiting factor is the > bandwidth and not the speed of encryption, but I might be wrong. Testing it > is :) > As far as I can remember, encryption was dead-slow in my benchmarks and actually was the limiting factor (except when compression with a full disk was used, which resulted in really bad compression performance). Without encryption moves could be performed almost twice as quickly in some cases. > > >> >> >>> >>> >>>> + >>>> +Lock reduction >>>> +============== >>>> + >>>> +An instance move as executed by the move-instance tool consists of >>>> several >>>> +preparatory RAPI calls, leading up to two long-lasting opcodes: >>>> OpCreateInstance >>>> +and OpBackupExport. While OpBackupExport locks only the instance, the >>>> locks of >>>> +OpCreateInstance require more attention. >>>> + >>>> +When executed, this opcode attempts to lock all nodes on which the >>>> instance may >>>> +be created and obtain shared locks on the groups they belong to. In >>>> the case >>>> +that an IAllocator is used, this means all nodes must be locked. Any >>>> operation >>>> +that requires a node lock to be present can delay the move operation, >>>> and there >>>> +is no shortage of these. >>>> + >>>> +The concept of opportunistic locking has been introduced to remedy >>>> exactly this >>>> +situation, allowing the IAllocator to grab as many node locks as >>>> possible. >>>> +Depending on how many nodes were available, the operation either >>>> proceeds as >>>> +expected, or fails noting that it is temporarily infeasible. The >>>> failure case >>>> +is unacceptable for the move-instance tool, which is expected to fail >>>> only if >>>> +the move is impossible. To yield the benefits of opportunistic locking >>>> yet >>>> +satisfy this constraint, the move-instance tool can be extended with >>>> the >>>> +--opportunistic-tries and --opportunistic-try-delay options. A number >>>> of >>>> +opportunistic instance creations are attempted, with a delay between >>>> attempts. >>>> >>> >> Definitely the delays should be randomized to avoid inadvertently >> synchronized simultaneous attempts by multiple jobs. >> > > Ack. > > >> >> >>> +Should they all fail, a normal and blocking instance creation is >>>> requested. >>>> >>> >> I don't fully understand this. Does it mean that if opportunistic locking >> using an IAllocator fails, it'd fall back to just trying to pick up any >> node (or any two nodes) available? >> > > No, it'd fall back to a non-opportunistic use of an IAllocator, blocking > the execution of the move until all the node locks on the target cluster > can be acquired. Will rewrite. > >> >> >>> + >>>> +While it may seem excessive to grab so many node locks, the early >>>> release >>>> +mechanism is used to make the situation less dire, releasing all nodes >>>> that were >>>> +not chosen as candidates for allocation. This is taken to the extreme >>>> as all the >>>> +locks acquired are released prior to the start of the transfer, >>>> barring the >>>> +newly-acquired lock over the new instance. This works because all >>>> operations >>>> +that alter the node in a way which could affect the transfer: >>>> + >>>> +* are prevented by the instance lock or instance presence, e.g. >>>> gnt-node remove, >>>> + gnt-node evacuate, >>>> + >>>> +* do not interrupt the transfer, e.g. a PV on the node can be set as >>>> + unallocatable, and the transfer still proceeds as expected, >>>> + >>>> +* do not care, e.g. a gnt-node powercycle explicitly ignores all locks. >>>> + >>>> +This is an invariant to be kept in mind for future development, but at >>>> the >>>> +current time, no additional locks are needed. >>>> >>> >> I'm a bit confused about what is the conclusion of this section. Does it >> propose any lock changes (reduction)? Or just proposes adding retries for >> instance creation if opportunistic locking fails? >> >> There is no general reduction in lock types acquired, nor can locks be > released earlier. Opportunistic locking may result in earlier execution of > operations, but it is just a matter of using it as the feature is already > present. > I will rewrite this to improve clarity. > > >> Perhaps we should rather aim for improving opportunistic locking in >> general, allowing these parameters for all LUs that use opportunistic >> locking. There are other LUs that use opportunistic locking as well. >> > > That is a good point, but the scope of this change would be much greater > than the one proposed in this design document. When retrying, the > move-instance tool can simply issue another creation job, identical to the > previous one. Adding the option to the LU itself would mean introducing a > mechanism for the automatic retrying of LUs. While this can and probably > should be done, it is a much greater refactoring of the jobs in Ganeti and > should be undertaken separately. > > >> >> + >>>> +Introduction of changes >>>> +======================= >>>> + >>>> +Both the instance zeroing and the lock reduction will be implemented >>>> as a part >>>> +of Ganeti 2.12, in the way described in the previous chapters. They >>>> will be >>>> +implemented as separate changes, first the lock reduction, and then >>>> the instance >>>> +zeroing due to the implementation overlapping and benefitting from the >>>> changes >>>> +needed for the OS installation improvements. >>>> -- >>>> 1.7.10.4 >>>> >>>> >>> Would it make sense to share this design doc as well with the SRE's? I >>> know that climent@ filed the bug about instance moves, but he's not >>> working on it any more. So ganeti-sre@ or ganeti-team@ might be >>> appropriate. >>> >>> Cheers, >>> Thomas >>> >>> >>> -- >>> Thomas Thrainer | Software Engineer | [email protected] | >>> >>> Google Germany GmbH >>> Dienerstr. 12 >>> 80331 München >>> >>> Registergericht und -nummer: Hamburg, HRB 86891 >>> Sitz der Gesellschaft: Hamburg >>> Geschäftsführer: Graham Law, Christine Elizabeth Flores >>> >> >> > -- Thomas Thrainer | Software Engineer | [email protected] | Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Christine Elizabeth Flores
