On Tue, Feb 4, 2014 at 6:05 PM, Hrvoje Ribicic <[email protected]> wrote:
> This patch adds a design document exploring zeroing and lock reduction > as options for the improved performance and parallelism of > cross-cluster instance moves. > > Signed-off-by: Hrvoje Ribicic <[email protected]> > --- > doc/design-move-instance-improvements.rst | 182 > +++++++++++++++++++++++++++++ > 1 file changed, 182 insertions(+) > create mode 100644 doc/design-move-instance-improvements.rst > > diff --git a/doc/design-move-instance-improvements.rst > b/doc/design-move-instance-improvements.rst > new file mode 100644 > index 0000000..22b4bf5 > --- /dev/null > +++ b/doc/design-move-instance-improvements.rst > @@ -0,0 +1,182 @@ > +======================================== > +Cross-cluster instance move improvements > +======================================== > + > +.. contents:: :depth: 3 > + > +To move instances across clusters, Ganeti provides the move-instance > tool. It > +uses the RAPI to create new instances in the destination cluster, ready to > +import data from instances in the source cluster. > + > +The tool works correctly and reliably, but depending on bandwidth and > priority, > +an instance disk of considerable size requires a long time to transfer. > This is > +inconvenient at best, and can be remedied by either reducing the length > of the > +transfers, or allowing more operations to run in parallel with instance > moves. > + > +The former can be achieved through the zeroing of empty space on instance > disks > +and compressing them prior to transfer, and the latter by reducing the > amount of > +locking happening during an instance move. As the approaches aim to > tackle two > +different aspects of the problem, they do not exclude each other and will > be > +presented independently. > + > +Zeroing instance disks > +====================== > + > +Support for disk compression during instance moves was partially present > before, > +but cleaned up and explicitly added as the --compress option only as of > Ganeti > +2.10. While compression lowers the amount of data sent, further gains can > be > +achieved by taking advantage of the structure of the disk - namely, > sending only > +used disk sectors. > + > +There is no direct way to achieve this, as it would require that the > +move-instance tool is aware of the structure of the file system. Mounting > the > +filesystem is not an option, primarily due to security issues. A disk > primed to > +take advantage of a disk driver exploit could cause an attacker to breach > +instance isolation and gain control of a Ganeti node. > + > +An indirect way for this performance gain to be achieved is the zeroing > of the > +empty hard disk space. Sequences of zeroes can be compressed and thus > +transferred very efficiently, all without the host knowing that these are > empty > +space. This approach can also be dangerous if a sparse disk is zeroed in > this > +way, causing ballooning. As Ganeti does not seem to make special > concessions for > +moving sparse disks, the only difference should be the disk space > utilization > +on the current node. > + > +Zeroing approaches > +++++++++++++++++++ > + > +Zeroing is a feasible approach, but the node cannot perform it as it > cannot > +mount the disk. Only virtualization-based options remain, and of those, > using > +Ganeti's own virtualization capabilities makes the most sense. There are > two > +ways of doing this - creating a new helper instance, temporary or > persistent, or > +reusing the target instance. > + > +Both approaches have their disadvantages. Creating a new helper instance > +requires managing its lifecycle, taking special care to make sure no > helper > +instance remains left over due to a failed operation. Even if this were > to be > +taken care of, disks are not yet separate entities in Ganeti, making the > +temporary transfer of disks between instances hard to implement and even > harder > +to make robust. The reuse can be done by modifying the OS running on the > +instance to perform the zeroing itself when notified via the new instance > +communication mechanism, but this approach is neither generic, nor > particularly > +safe. There is no guarantee that the zeroing operation will not interfere > with > +the normal operation of the instance, nor that it will be completed if a > +user-initiated shutdown occurs. > + > +A better solution can be found by combining the two approaches - re-using > the > +virtualized environment, but with a specifically crafted OS image. With > the > +instance shut down as it should be in preparation for the move, it can be > +extended with an additional disk with the OS image on it. By prepending > the > +disk and changing some instance parameters, the instance can boot from > it. The > +OS can be configured to perform the zeroing on startup, attempting to > mount any > +partitions with a filesystem present, and creating and deleting a > zero-filled > +file on them. After the zeroing is complete, the OS should shut down, and > the > +master should note the shutdown and restore the instance to its previous > state. > + > +Note that the requirements above are very similar to the notion of a > helper VM > +suggested in the OS install document. Some potentially unsafe actions are > +performed within a virtualized environment, acting on disks that belong > or will > +belong to the instance. The mechanisms used will thus be developed with > both > +approaches in mind. > + > +Implementation > +++++++++++++++ > + > +There are two components to this solution - the Ganeti changes needed to > boot > +the OS, and the OS image used for the zeroing. Due to the variety of > filesystems > +and architectures that instances can use, no single ready-to-run disk > image can > +satisfy the needs of all the Ganeti users. Instead, the > instance-debootstrap > +scripts can be used to generate a zeroing-capable OS image. This might > not be > +ideal, as there are lightweight distributions that take up less space and > boot > +up more quickly. Generating those with the right set of drivers for the > +virtualization platform of choice is not easy. Thus we do not provide a > script > +for this purpose, but the user is free to provide any OS image which > performs > +the necessary steps: zero out all virtualization-provided devices on > startup, > +shutdown immediately. The cluster-wide parameter controlling the image to > be > +used would be called zeroing-image. > + > +The modifications to Ganeti code needed are minor. The zeroing > functionality > +should be implemented as an extension of the instance export, and exposed > as the > +--zero-free-space option. Prior to beginning the export, the instance > +configuration is temporarily extended with a new read-only disk of > sufficient > +size to host the zeroing image, and the changes necessary for the image > to be > +used as the boot drive. The temporary nature of the configuration changes > +requires that they are not propagated to other nodes. While this would > normally > +not be feasible with an instance using a disk template offering multi-node > +redundancy, experiments with the code have shown that the restriction on > +diverse disk templates can be bypassed to temporarily allow a plain > +disk-template disk to host the zeroing image. The image is dumped to the > disk, > +and the instance is started up. > + > +Once the instance is started up, the zeroing will proceed until > completion, when > +a self-initiated shutdown will occur. The instance-shutdown detection > +capabilities of 2.11 should prevent the watcher from restarting the > instance > +once this happens, allowing the host to take it as a sign the zeroing was > +completed. Either way, the host waits until the instance is shut down, or > a > +user-defined timeout has been reached and the instance is forcibly shut > down. > This timeout should be dependent on the size of the disks of the instance. Zeroing 300GB can take some time, and such instances could happily exist next to 10GB ones... > + > +Better progress monitoring can be implemented with the instance-host > +communication channel proposed by the OS install design document. The > first > +version will most likely use only the shutdown detection, and will be > improved > +to account for the available communication channel at a later time. > + > +After the shutdown, the temporary disk is destroyed and the instance > +configuration is reverted to its original state. The very same action is > done if > +any error is encountered during the zeroing process. In the case that the > +zeroing is interrupted while the zero-filled file is being written, there > is > +little that can be done to recover. One precautionary measure is to place > the > +file in the /tmp directory on Unix systems, if one exists and can be > identified > +as such. Even if TmpFS is mounted there, it is the most likely location > to be > +cleaned up in case of failure. > If TmpFS is mounted there, it would hide the zero-file from the user and making it thus harder to recover manually from such a problem. Also, if the filesystem is not the root filesystem of the guest but usually mounted under e.g. /home, there wouldn't be a /tmp directory... Anyway, both approaches have advantages and disadvantages, so I would personally go for the easier one. Another note: the OS image could/should also zero all swap partition completely in order to save some more space. Something I'm missing in this part of the design is a discussion of compression-methods (maybe with a lot of zeros something really fast can be used) and/or a (semi-) automated way of figuring out if zeroing+compression is faster than just sending the whole data. I agree that this is a bit out of scope for now, but the user should at least have the option to enable or disable zeroing. For future work, move-instance could get a rough measurement of the throughput between the clusters and could then decide based on the size of the instance disks and some heuristics if zeroing makes sense. Another thing missing is the discussion of encryption algorithms. The method to encrypt the data sent from one cluster to the other can be configured and plays quite a big role throughput-wise. We could give users the choice to use another (possibly weaker) encryption if they want more speed and/or review the choice we've made. > + > +Lock reduction > +============== > + > +An instance move as executed by the move-instance tool consists of several > +preparatory RAPI calls, leading up to two long-lasting opcodes: > OpCreateInstance > +and OpBackupExport. While OpBackupExport locks only the instance, the > locks of > +OpCreateInstance require more attention. > + > +When executed, this opcode attempts to lock all nodes on which the > instance may > +be created and obtain shared locks on the groups they belong to. In the > case > +that an IAllocator is used, this means all nodes must be locked. Any > operation > +that requires a node lock to be present can delay the move operation, and > there > +is no shortage of these. > + > +The concept of opportunistic locking has been introduced to remedy > exactly this > +situation, allowing the IAllocator to grab as many node locks as possible. > +Depending on how many nodes were available, the operation either proceeds > as > +expected, or fails noting that it is temporarily infeasible. The failure > case > +is unacceptable for the move-instance tool, which is expected to fail > only if > +the move is impossible. To yield the benefits of opportunistic locking yet > +satisfy this constraint, the move-instance tool can be extended with the > +--opportunistic-tries and --opportunistic-try-delay options. A number of > +opportunistic instance creations are attempted, with a delay between > attempts. > +Should they all fail, a normal and blocking instance creation is > requested. > + > +While it may seem excessive to grab so many node locks, the early release > +mechanism is used to make the situation less dire, releasing all nodes > that were > +not chosen as candidates for allocation. This is taken to the extreme as > all the > +locks acquired are released prior to the start of the transfer, barring > the > +newly-acquired lock over the new instance. This works because all > operations > +that alter the node in a way which could affect the transfer: > + > +* are prevented by the instance lock or instance presence, e.g. gnt-node > remove, > + gnt-node evacuate, > + > +* do not interrupt the transfer, e.g. a PV on the node can be set as > + unallocatable, and the transfer still proceeds as expected, > + > +* do not care, e.g. a gnt-node powercycle explicitly ignores all locks. > + > +This is an invariant to be kept in mind for future development, but at the > +current time, no additional locks are needed. > + > +Introduction of changes > +======================= > + > +Both the instance zeroing and the lock reduction will be implemented as a > part > +of Ganeti 2.12, in the way described in the previous chapters. They will > be > +implemented as separate changes, first the lock reduction, and then the > instance > +zeroing due to the implementation overlapping and benefitting from the > changes > +needed for the OS installation improvements. > -- > 1.7.10.4 > > Would it make sense to share this design doc as well with the SRE's? I know that climent@ filed the bug about instance moves, but he's not working on it any more. So ganeti-sre@ or ganeti-team@ might be appropriate. Cheers, Thomas -- Thomas Thrainer | Software Engineer | [email protected] | Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Christine Elizabeth Flores
