To address many of the comments here, I made significant changes to the document, and sent it out as a separate mail which should have a very similar title.
On Tue, Feb 11, 2014 at 8:48 AM, Thomas Thrainer <[email protected]>wrote: > > > > On Fri, Feb 7, 2014 at 3:24 PM, Hrvoje Ribicic <[email protected]> wrote: > >> On Fri, Feb 7, 2014 at 10:17 AM, Petr Pudlák <[email protected]> wrote: >> >>> >>> >>> >>> On Thu, Feb 6, 2014 at 9:47 AM, Thomas Thrainer <[email protected]>wrote: >>> >>>> >>>> >>>> >>>> On Tue, Feb 4, 2014 at 6:05 PM, Hrvoje Ribicic <[email protected]> wrote: >>>> >>>>> This patch adds a design document exploring zeroing and lock reduction >>>>> as options for the improved performance and parallelism of >>>>> cross-cluster instance moves. >>>>> >>>>> Signed-off-by: Hrvoje Ribicic <[email protected]> >>>>> --- >>>>> doc/design-move-instance-improvements.rst | 182 >>>>> +++++++++++++++++++++++++++++ >>>>> 1 file changed, 182 insertions(+) >>>>> create mode 100644 doc/design-move-instance-improvements.rst >>>>> >>>>> diff --git a/doc/design-move-instance-improvements.rst >>>>> b/doc/design-move-instance-improvements.rst >>>>> new file mode 100644 >>>>> index 0000000..22b4bf5 >>>>> --- /dev/null >>>>> +++ b/doc/design-move-instance-improvements.rst >>>>> @@ -0,0 +1,182 @@ >>>>> +======================================== >>>>> +Cross-cluster instance move improvements >>>>> +======================================== >>>>> + >>>>> +.. contents:: :depth: 3 >>>>> + >>>>> +To move instances across clusters, Ganeti provides the move-instance >>>>> tool. It >>>>> +uses the RAPI to create new instances in the destination cluster, >>>>> ready to >>>>> +import data from instances in the source cluster. >>>>> + >>>>> +The tool works correctly and reliably, but depending on bandwidth and >>>>> priority, >>>>> +an instance disk of considerable size requires a long time to >>>>> transfer. This is >>>>> +inconvenient at best, and can be remedied by either reducing the >>>>> length of the >>>>> +transfers, or allowing more operations to run in parallel with >>>>> instance moves. >>>>> + >>>>> +The former can be achieved through the zeroing of empty space on >>>>> instance disks >>>>> +and compressing them prior to transfer, and the latter by reducing >>>>> the amount of >>>>> +locking happening during an instance move. As the approaches aim to >>>>> tackle two >>>>> +different aspects of the problem, they do not exclude each other and >>>>> will be >>>>> +presented independently. >>>>> + >>>>> +Zeroing instance disks >>>>> +====================== >>>>> + >>>>> +Support for disk compression during instance moves was partially >>>>> present before, >>>>> +but cleaned up and explicitly added as the --compress option only as >>>>> of Ganeti >>>>> +2.10. While compression lowers the amount of data sent, further gains >>>>> can be >>>>> +achieved by taking advantage of the structure of the disk - namely, >>>>> sending only >>>>> +used disk sectors. >>>>> + >>>>> +There is no direct way to achieve this, as it would require that the >>>>> +move-instance tool is aware of the structure of the file system. >>>>> Mounting the >>>>> +filesystem is not an option, primarily due to security issues. A disk >>>>> primed to >>>>> +take advantage of a disk driver exploit could cause an attacker to >>>>> breach >>>>> +instance isolation and gain control of a Ganeti node. >>>>> + >>>>> +An indirect way for this performance gain to be achieved is the >>>>> zeroing of the >>>>> +empty hard disk space. Sequences of zeroes can be compressed and thus >>>>> +transferred very efficiently, all without the host knowing that these >>>>> are empty >>>>> +space. This approach can also be dangerous if a sparse disk is zeroed >>>>> in this >>>>> +way, causing ballooning. As Ganeti does not seem to make special >>>>> concessions for >>>>> +moving sparse disks, the only difference should be the disk space >>>>> utilization >>>>> +on the current node. >>>>> + >>>>> +Zeroing approaches >>>>> +++++++++++++++++++ >>>>> + >>>>> +Zeroing is a feasible approach, but the node cannot perform it as it >>>>> cannot >>>>> +mount the disk. Only virtualization-based options remain, and of >>>>> those, using >>>>> +Ganeti's own virtualization capabilities makes the most sense. There >>>>> are two >>>>> +ways of doing this - creating a new helper instance, temporary or >>>>> persistent, or >>>>> +reusing the target instance. >>>>> + >>>>> +Both approaches have their disadvantages. Creating a new helper >>>>> instance >>>>> +requires managing its lifecycle, taking special care to make sure no >>>>> helper >>>>> +instance remains left over due to a failed operation. Even if this >>>>> were to be >>>>> +taken care of, disks are not yet separate entities in Ganeti, making >>>>> the >>>>> +temporary transfer of disks between instances hard to implement and >>>>> even harder >>>>> +to make robust. The reuse can be done by modifying the OS running on >>>>> the >>>>> +instance to perform the zeroing itself when notified via the new >>>>> instance >>>>> +communication mechanism, but this approach is neither generic, nor >>>>> particularly >>>>> +safe. There is no guarantee that the zeroing operation will not >>>>> interfere with >>>>> +the normal operation of the instance, nor that it will be completed >>>>> if a >>>>> +user-initiated shutdown occurs. >>>>> + >>>>> +A better solution can be found by combining the two approaches - >>>>> re-using the >>>>> +virtualized environment, but with a specifically crafted OS image. >>>>> With the >>>>> +instance shut down as it should be in preparation for the move, it >>>>> can be >>>>> +extended with an additional disk with the OS image on it. By >>>>> prepending the >>>>> +disk and changing some instance parameters, the instance can boot >>>>> from it. The >>>>> +OS can be configured to perform the zeroing on startup, attempting to >>>>> mount any >>>>> +partitions with a filesystem present, and creating and deleting a >>>>> zero-filled >>>>> +file on them. After the zeroing is complete, the OS should shut down, >>>>> and the >>>>> +master should note the shutdown and restore the instance to its >>>>> previous state. >>>>> + >>>>> +Note that the requirements above are very similar to the notion of a >>>>> helper VM >>>>> +suggested in the OS install document. Some potentially unsafe actions >>>>> are >>>>> +performed within a virtualized environment, acting on disks that >>>>> belong or will >>>>> +belong to the instance. The mechanisms used will thus be developed >>>>> with both >>>>> +approaches in mind. >>>>> + >>>>> +Implementation >>>>> +++++++++++++++ >>>>> + >>>>> +There are two components to this solution - the Ganeti changes needed >>>>> to boot >>>>> +the OS, and the OS image used for the zeroing. Due to the variety of >>>>> filesystems >>>>> +and architectures that instances can use, no single ready-to-run disk >>>>> image can >>>>> +satisfy the needs of all the Ganeti users. Instead, the >>>>> instance-debootstrap >>>>> +scripts can be used to generate a zeroing-capable OS image. This >>>>> might not be >>>>> +ideal, as there are lightweight distributions that take up less space >>>>> and boot >>>>> +up more quickly. Generating those with the right set of drivers for >>>>> the >>>>> +virtualization platform of choice is not easy. Thus we do not provide >>>>> a script >>>>> +for this purpose, but the user is free to provide any OS image which >>>>> performs >>>>> +the necessary steps: zero out all virtualization-provided devices on >>>>> startup, >>>>> +shutdown immediately. The cluster-wide parameter controlling the >>>>> image to be >>>>> +used would be called zeroing-image. >>>>> + >>>>> +The modifications to Ganeti code needed are minor. The zeroing >>>>> functionality >>>>> +should be implemented as an extension of the instance export, and >>>>> exposed as the >>>>> +--zero-free-space option. Prior to beginning the export, the instance >>>>> +configuration is temporarily extended with a new read-only disk of >>>>> sufficient >>>>> +size to host the zeroing image, and the changes necessary for the >>>>> image to be >>>>> +used as the boot drive. The temporary nature of the configuration >>>>> changes >>>>> +requires that they are not propagated to other nodes. While this >>>>> would normally >>>>> +not be feasible with an instance using a disk template offering >>>>> multi-node >>>>> +redundancy, experiments with the code have shown that the restriction >>>>> on >>>>> +diverse disk templates can be bypassed to temporarily allow a plain >>>>> +disk-template disk to host the zeroing image. The image is dumped to >>>>> the disk, >>>>> +and the instance is started up. >>>>> + >>>>> +Once the instance is started up, the zeroing will proceed until >>>>> completion, when >>>>> +a self-initiated shutdown will occur. The instance-shutdown detection >>>>> +capabilities of 2.11 should prevent the watcher from restarting the >>>>> instance >>>>> +once this happens, allowing the host to take it as a sign the zeroing >>>>> was >>>>> +completed. Either way, the host waits until the instance is shut >>>>> down, or a >>>>> +user-defined timeout has been reached and the instance is forcibly >>>>> shut down. >>>>> >>>> >>>> This timeout should be dependent on the size of the disks of the >>>> instance. Zeroing 300GB can take some time, and such instances could >>>> happily exist next to 10GB ones... >>>> >>>> >>>>> + >>>>> +Better progress monitoring can be implemented with the instance-host >>>>> +communication channel proposed by the OS install design document. The >>>>> first >>>>> +version will most likely use only the shutdown detection, and will be >>>>> improved >>>>> +to account for the available communication channel at a later time. >>>>> + >>>>> +After the shutdown, the temporary disk is destroyed and the instance >>>>> +configuration is reverted to its original state. The very same action >>>>> is done if >>>>> +any error is encountered during the zeroing process. In the case that >>>>> the >>>>> +zeroing is interrupted while the zero-filled file is being written, >>>>> there is >>>>> +little that can be done to recover. One precautionary measure is to >>>>> place the >>>>> +file in the /tmp directory on Unix systems, if one exists and can be >>>>> identified >>>>> +as such. Even if TmpFS is mounted there, it is the most likely >>>>> location to be >>>>> +cleaned up in case of failure. >>>>> >>>> >>>> If TmpFS is mounted there, it would hide the zero-file from the user >>>> and making it thus harder to recover manually from such a problem. Also, if >>>> the filesystem is not the root filesystem of the guest but usually mounted >>>> under e.g. /home, there wouldn't be a /tmp directory... Anyway, both >>>> approaches have advantages and disadvantages, so I would personally go for >>>> the easier one. >>>> >>> >>> I wouldn't be so sure that about /tmp being cleaned up, if it's a >>> mount-point for TmpFS or another separate partition. I guess an OS first >>> mounts partitions and only then cleans up /tmp. >>> >>> >> Ack - that part will be removed. >> >> >>> >>>> Another note: the OS image could/should also zero all swap partition >>>> completely in order to save some more space. >>>> >>>> >>>> Something I'm missing in this part of the design is a discussion of >>>> compression-methods (maybe with a lot of zeros something really fast can be >>>> used) >>>> >>> >>> I've had a good experience with lzop: http://en.wthe gains are >>> dependent on the gains are dependent on >>> ikipedia.org/wiki/Lzop<http://en.wikipedia.org/wiki/Lzop> >>> It's _very_ fast compared to other compression tools, so definitely it >>> wouldn't be a bottleneck, and for blocks of zeroes it would work just as >>> well as any other algorithm. I tried to compress 1GB of zeroes, it took >>> 0.5s and got compressed into 4.5MB: >>> >>> dd bs=1MB count=1024 if=/dev/zero | lzop | wc --bytes >>> 1024+0 records in >>> 1024+0 records out >>> 1024000000 bytes (1.0 GB) copied, 2.50511 s, 409 MB/s >>> 4668035 >>> >>> >> >> I actually do not know if 1GB of zeroes is a good benchmark for a >> compression tool in this case. With an ext? filesystem, the empty space is >> likely to be very fragmented, with pockets of zeroed free space scattered >> amongst files. My hunch is also that speed rules as the ratio will be just >> about the same for all compression tools, but I would like to do some >> testing on a more realistic-looking drive first. The choice of compression >> tool used would certainly be added as an option. >> > > That was the result of my tests as well - slower compression algorithms > didn't produce much smaller results but took way longer. So lzop might be a > good option if it's available. > > >> >> >>> and/or a (semi-) automated way of figuring out if zeroing+compression >>>> is faster than just sending the whole data. I agree that this is a bit out >>>> of scope for now, but the user should at least have the option to enable or >>>> disable zeroing. For future work, move-instance could get a rough >>>> measurement of the throughput between the clusters and could then decide >>>> based on the size of the instance disks and some heuristics if zeroing >>>> makes sense. >>>> >>>> Another thing missing is the discussion of encryption algorithms. The >>>> method to encrypt the data sent from one cluster to the other can be >>>> configured and plays quite a big role throughput-wise. We could give users >>>> the choice to use another (possibly weaker) encryption if they want more >>>> speed and/or review the choice we've made. >>>> >>> >>> It'd be interesting to make some tests and measure the impact of various >>> encryption algorithms. I remember using Blowfish with SSH to reduce CPU >>> load and speed up transfers, but perhaps nowadays with faster CPUs and >>> optimizations in encryption algorithms the difference isn't so large. >>> >> >> I guess that for cross-cluster transfers, the limiting factor is the >> bandwidth and not the speed of encryption, but I might be wrong. Testing it >> is :) >> > > As far as I can remember, encryption was dead-slow in my benchmarks and > actually was the limiting factor (except when compression with a full disk > was used, which resulted in really bad compression performance). Without > encryption moves could be performed almost twice as quickly in some cases. > > >> >> >>> >>> >>>> >>>> >>>>> + >>>>> +Lock reduction >>>>> +============== >>>>> + >>>>> +An instance move as executed by the move-instance tool consists of >>>>> several >>>>> +preparatory RAPI calls, leading up to two long-lasting opcodes: >>>>> OpCreateInstance >>>>> +and OpBackupExport. While OpBackupExport locks only the instance, the >>>>> locks of >>>>> +OpCreateInstance require more attention. >>>>> + >>>>> +When executed, this opcode attempts to lock all nodes on which the >>>>> instance may >>>>> +be created and obtain shared locks on the groups they belong to. In >>>>> the case >>>>> +that an IAllocator is used, this means all nodes must be locked. Any >>>>> operation >>>>> +that requires a node lock to be present can delay the move operation, >>>>> and there >>>>> +is no shortage of these. >>>>> + >>>>> +The concept of opportunistic locking has been introduced to remedy >>>>> exactly this >>>>> +situation, allowing the IAllocator to grab as many node locks as >>>>> possible. >>>>> +Depending on how many nodes were available, the operation either >>>>> proceeds as >>>>> +expected, or fails noting that it is temporarily infeasible. The >>>>> failure case >>>>> +is unacceptable for the move-instance tool, which is expected to fail >>>>> only if >>>>> +the move is impossible. To yield the benefits of opportunistic >>>>> locking yet >>>>> +satisfy this constraint, the move-instance tool can be extended with >>>>> the >>>>> +--opportunistic-tries and --opportunistic-try-delay options. A number >>>>> of >>>>> +opportunistic instance creations are attempted, with a delay between >>>>> attempts. >>>>> >>>> >>> Definitely the delays should be randomized to avoid inadvertently >>> synchronized simultaneous attempts by multiple jobs. >>> >> >> Ack. >> >> >>> >>> >>>> +Should they all fail, a normal and blocking instance creation is >>>>> requested. >>>>> >>>> >>> I don't fully understand this. Does it mean that if opportunistic >>> locking using an IAllocator fails, it'd fall back to just trying to pick up >>> any node (or any two nodes) available? >>> >> >> No, it'd fall back to a non-opportunistic use of an IAllocator, blocking >> the execution of the move until all the node locks on the target cluster >> can be acquired. Will rewrite. >> >>> >>> >>>> + >>>>> +While it may seem excessive to grab so many node locks, the early >>>>> release >>>>> +mechanism is used to make the situation less dire, releasing all >>>>> nodes that were >>>>> +not chosen as candidates for allocation. This is taken to the extreme >>>>> as all the >>>>> +locks acquired are released prior to the start of the transfer, >>>>> barring the >>>>> +newly-acquired lock over the new instance. This works because all >>>>> operations >>>>> +that alter the node in a way which could affect the transfer: >>>>> + >>>>> +* are prevented by the instance lock or instance presence, e.g. >>>>> gnt-node remove, >>>>> + gnt-node evacuate, >>>>> + >>>>> +* do not interrupt the transfer, e.g. a PV on the node can be set as >>>>> + unallocatable, and the transfer still proceeds as expected, >>>>> + >>>>> +* do not care, e.g. a gnt-node powercycle explicitly ignores all >>>>> locks. >>>>> + >>>>> +This is an invariant to be kept in mind for future development, but >>>>> at the >>>>> +current time, no additional locks are needed. >>>>> >>>> >>> I'm a bit confused about what is the conclusion of this section. Does it >>> propose any lock changes (reduction)? Or just proposes adding retries for >>> instance creation if opportunistic locking fails? >>> >>> There is no general reduction in lock types acquired, nor can locks be >> released earlier. Opportunistic locking may result in earlier execution of >> operations, but it is just a matter of using it as the feature is already >> present. >> I will rewrite this to improve clarity. >> >> >>> Perhaps we should rather aim for improving opportunistic locking in >>> general, allowing these parameters for all LUs that use opportunistic >>> locking. There are other LUs that use opportunistic locking as well. >>> >> >> That is a good point, but the scope of this change would be much greater >> than the one proposed in this design document. When retrying, the >> move-instance tool can simply issue another creation job, identical to the >> previous one. Adding the option to the LU itself would mean introducing a >> mechanism for the automatic retrying of LUs. While this can and probably >> should be done, it is a much greater refactoring of the jobs in Ganeti and >> should be undertaken separately. >> >> >>> >>> + >>>>> +Introduction of changes >>>>> +======================= >>>>> + >>>>> +Both the instance zeroing and the lock reduction will be implemented >>>>> as a part >>>>> +of Ganeti 2.12, in the way described in the previous chapters. They >>>>> will be >>>>> +implemented as separate changes, first the lock reduction, and then >>>>> the instance >>>>> +zeroing due to the implementation overlapping and benefitting from >>>>> the changes >>>>> +needed for the OS installation improvements. >>>>> -- >>>>> 1.7.10.4 >>>>> >>>>> >>>> Would it make sense to share this design doc as well with the SRE's? I >>>> know that climent@ filed the bug about instance moves, but he's not >>>> working on it any more. So ganeti-sre@ or ganeti-team@ might be >>>> appropriate. >>>> >>>> Cheers, >>>> Thomas >>>> >>>> >>>> -- >>>> Thomas Thrainer | Software Engineer | [email protected] | >>>> >>>> Google Germany GmbH >>>> Dienerstr. 12 >>>> 80331 München >>>> >>>> Registergericht und -nummer: Hamburg, HRB 86891 >>>> Sitz der Gesellschaft: Hamburg >>>> Geschäftsführer: Graham Law, Christine Elizabeth Flores >>>> >>> >>> >> > > > -- > Thomas Thrainer | Software Engineer | [email protected] | > > Google Germany GmbH > Dienerstr. 12 > 80331 München > > Registergericht und -nummer: Hamburg, HRB 86891 > Sitz der Gesellschaft: Hamburg > Geschäftsführer: Graham Law, Christine Elizabeth Flores >
