Hey folks, I've been testing and finding bugs on a branch of online snapshots for the past few days. The good news is that taking an online snapshot seems to be fairly robust -- I've been taking online-snapshots as quickly as possible on a 5 node cluster being battered by a performance eval random write run.
As expected we ran into some hiccups. In my last run of the PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This is ok, some failures are actually expected (the first cut only claims better consistency than 'copytable' and 'only-on-a-sunny-day' semantics). From a quick viewing of what cause the failed cases, if splits or balancing occurs while a snapshotting, the region moves cause the final snapshot verification step to abort because we look for the new regions and don't know if we have all regions. We've also found some problems with splits of hfilelinks (HBASE-7339), and we've encountered an occasional failed-hang clone attempts (HBASE-7352), and an occasional ZK related slow abort. As they are found and characterized, I've been filing them under HBASE-6055 (offline-snapshots) or HBASE-7290 (online-snapshots). I'm going to switch from bug fixing mode back to patch polishing mode today to get some of this committed to the snapshot dev branch. Here's how I hope to deal with them moving forward. I'll be polishing the pieces I've been testing (there are about 5-7 patches in-flight currently) and putting updated pieces up for review. There is non-trivial overhead maintaining this many patches "in the future". Since this is a dev-branch, I'm going to ask reviewing these initial big dev-branch reviews focus on understandability and that your +1's would let us punt to follow-on jiras and TODOs more frequently than if you were reviewing for trunk. The sooner we get the skeleton in, the easier collaboration with other folks working and testing the same branch. Ideally, getting the large pieces in would allow follow-ons to be easier to review and tackle. The promise here, of course, is that many of these follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be blockers before merging to offline snapshots to trunk and merging online snapshots to trunk. Sound good? We've initially had one snapshot branch (offline snapshots) but I'm proposing having two: the offline-snapshot branch and the online-snapshot branch. Jesse's been the master of the offline branch and pushing dev-branch patches to that branch ( https://github.com/jyates/hbase/tree/snapshots). I'd like to soon begin pushing dev-branch *reviewed commits* for online-snapshots to another branch. For those following here's an explanation of how I'm working. * The latest for review patches will be always be in review boards. * Branch committed portions (reviewed and +1'ed for the branch patches) for online snapshots will live here https://github.com/jmhsieh/hbase/tree/snapshots. My branch will periodically be force pushed to deal with rebases onto constantly updating trunk, and to include offline-branch committed patches. * The latest working and consolidated online-snapshot branch (commits correspond to HBASE jiras) will live at https://github.com/jmhsieh/hbase/tree/snapshots-work . This branch is subject to frequent forced pushes. It is a cleanup step done to prep patches for reviews, and match what eventual commits structure would look like. It also contains some patches that may be abandoned or reordered. * Rough incremental in-progress branches live here, https://github.com/jmhsieh/hbase/tree/snapshot-work-1213 (change 1213 with the latest date to see where I am). These rough branches have many small commits that focus on functionality and need to be rebased to "sprinkle" edits into the appropriate JIRA-corresponding patches. These branches will rarely if ever be force pushed. These are what I do testing from, and probably are suitable for others to use for testing. I periodically merge this with the snapshots-work mostly as a proof that what I have for review is the same as what I've been testing. Jon. -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [email protected]
