On Fri, Aug 10, 2018 at 8:29 AM Sage Weil <[email protected]> wrote: > On Fri, 10 Aug 2018, Paweł Sadowski wrote: > > On 08/09/2018 04:39 PM, Alex Elder wrote: > > > On 08/09/2018 08:15 AM, Sage Weil wrote: > > >> On Thu, 9 Aug 2018, Piotr Dałek wrote: > > >>> Hello, > > >>> > > >>> At OVH we're heavily utilizing snapshots for our backup system. We > think > > >>> there's an interesting optimization opportunity regarding snapshots > I'd like > > >>> to discuss here. > > >>> > > >>> The idea is to introduce a concept of a "lightweight" snapshots - > such > > >>> snapshot would not contain data but only the information about what > has > > >>> changed on the image since it was created (so basically only the > object map > > >>> part of snapshots). > > >>> > > >>> Our backup solution (which seems to be a pretty common practice) is > as > > >>> follows: > > >>> > > >>> 1. Create snapshot of the image we want to backup > > >>> 2. If there's a previous backup snapshot, export diff and apply it > on the > > >>> backup image > > >>> 3. If there's no older snapshot, just do a full backup of image > > >>> > > >>> This introduces one big issue: it enforces COW snapshot on image, > meaning that > > >>> original image access latencies and consumed space increases. > "Lightweight" > > >>> snapshots would remove these inefficiencies - no COW performance and > storage > > >>> overhead. > > >> > > >> The snapshot in 1 would be lightweight you mean? And you'd do the > backup > > >> some (short) time later based on a diff with changed extents? > > >> > > >> I'm pretty sure this will export a garbage image. I mean, it will > usually > > >> be non-garbage, but the result won't be crash consistent, and in some > > >> (many?) cases won't be usable. > > >> > > >> Consider: > > >> > > >> - take reference snapshot > > >> - back up this image (assume for now it is perfect) > > >> - write A to location 1 > > >> - take lightweight snapshot > > >> - write B to location 1 > > >> - backup process copie location 1 (B) to target > > > > The way I (we) see it working is a bit different: > > - take snapshot (1) > > - data write might occur, it's ok - CoW kicks in here to preserve data > > - export data > > - convert snapshot (1) to a lightweight one (not create new): > > * from now on just remember which blocks has been modified instead > > of doing CoW > > * you can get rid on previously CoW data blocks (they've been > > exported already) > > - more writes > > - take snapshot (2) > > - export diff - only blocks modified since snap (1) > > - convert snapshot (2) to a lightweight one > > - ... > > > > > > That way I don't see a place for data corruption. Of course this has > > some drawbacks - you can't rollback/export data from such lightweight > > snapshot anymore. But on the other hand we are reducing need for CoW - > > and that's the main goal with this idea. Instead of making CoW ~all the > > time it's needed only for the time of exporting image/modified blocks. > > Ok, so this is a bit different. I'm a bit fuzzy still on how the > 'lightweight' (1) snapshot will be implemented, but basically I think > you just mean saving on its storage overhead, but keeping enough metadata > to make a fully consistent (2) for the purposes of the backup. > > Maybe Jason has a better idea for how this would work in practice? I > haven't thought about the RBD snapshots in a while (not above the rados > layer at least). >
The 'fast-diff' object map already tracks updated objects since a snapshot was taken, so I think such an approach would just require deleting the RADOS self-managed snapshot when converting to "lightweight" mode and then just using the existing "--whole-object" option for "rbd export-diff" to utilize the 'fast-diff' object map for calculating deltas instead of relying on RADOS snap diffs. If you don't mind getting your hands dirty writing a little Python code to invoke "remove_self_managed_snap" using the snap id provided by "rbd snap ls", you should be able to test it out now. If it were to be incorporated into RBD core, I think it would need some sanity checks to ensure it relies on 'fast-diff' when handling a lightweight snapshot. However, I would also be interested to know if bluestore alleviates a lot of your latency concerns given that it attempts to redirect-on-write by updating metadata instead of copy-on-write. > > >> That's the wrong data. Maybe that change is harmless, but maybe > location > > >> 1 belongs to the filesystem journal, and you have some records that > now > > >> reference location 10 that as an A-era value, or haven't been written > at > > >> all yet, and now your file system journal won't replay and you can't > > >> mount... > > > > > > Forgive me if I'm misunderstanding; this just caught my attention. > > > > > > The goal here seems to be to reduce the storage needed to do backups > of an > > > RBD image, and I think there's something to that. > > > > Storage reduction is only side effect here. We want to get rid of CoW as > > much as possible. In an example - we are doing snapshot every 24h - this > > means that every 24h we will start doing CoW from the beginning on every > > image. This has big impact on a cluster latency > > > > As for the storage need, with 24h backup period we see a space usage > > increase by about 5% on our clusters. But this clearly depends on client > > traffic. > > One thing to keep in mind here is that the CoW/clone overheard goes *way* > down with BlueStore. On FileStore we are literally blocking to make > a copy of each 4MB object. With BlueStore there is a bit of metadata > overhead for the tracking but it is doing CoW at the lowest layer. > > Lightweight snapshots might be a big win for FileStore but that advantage > will mostly evaporate once you repave the OSDs. > > sage > > > > > This seems to be no different from any other incremental backup > scheme. It's > > > layered, and it's ultimately based on an "epoch" complete backup image > (what > > > you call the reference snapshot). > > > > > > If you're using that model, it would be useful to be able to back up > only > > > the data present in a second snapshot that's the child of the reference > > > snapshot. (And so on, with snapshot 2 building on snapshot 1, etc.) > > > RBD internally *knows* this information, but I'm not sure how (or > whether) > > > it's formally exposed. > > > > > > Restoring an image in this scheme requires restoring the epoch, then > the > > > incrementals, in order. The cost to restore is higher, but the cost > > > of incremental backups is significantly smaller than doing full ones. > > > > It depends how we will store exported data. We might just want to merge > > all diffs into base image right after export to keep only single copy. > > But that is out of scope of main topic here, IMHO. > > > > > I'm not sure how the "lightweight" snapshot would work though. Without > > > references to objects there's no guarantee the data taken at the time > of > > > the snapshot still exists when you want to back it up. > > > > > > -Alex > > > > > >> > > >> sage > > >> > > >>> At first glance, it seems like it could be implemented as extension > to current > > >>> RBD snapshot system, leaving out the machinery required for > copy-on-write. In > > >>> theory it could even co-exist with regular snapshots. Removal of > these > > >>> "lightweight" snapshots would be instant (or near instant). > > >>> > > >>> So what do others think about this? > > >>> > > >>> -- > > >>> Piotr Dałek > > >>> [email protected] > > >>> https://www.ovhcloud.com > > >>> -- > > >>> To unsubscribe from this list: send the line "unsubscribe > ceph-devel" in > > >>> the body of a message to [email protected] > > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >>> > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in > > > the body of a message to [email protected] > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Jason
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
