Thanks for the update, Jon. bq. if splits or balancing occurs while a snapshotting, the region moves cause the final snapshot verification step to abort
The split or balancing happened during snapshot verification step, right ? On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[email protected]> wrote: > Hey folks, > > I've been testing and finding bugs on a branch of online snapshots for the > past few days. The good news is that taking an online snapshot seems to be > fairly robust -- I've been taking online-snapshots as quickly as possible > on a 5 node cluster being battered by a performance eval random write run. > > > As expected we ran into some hiccups. In my last run of the > PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This is > ok, some failures are actually expected (the first cut only claims better > consistency than 'copytable' and 'only-on-a-sunny-day' semantics). From a > quick viewing of what cause the failed cases, if splits or balancing > occurs while a snapshotting, the region moves cause the final snapshot > verification step to abort because we look for the new regions and don't > know if we have all regions. We've also found some problems with splits of > hfilelinks (HBASE-7339), and we've encountered an occasional failed-hang > clone attempts (HBASE-7352), and an occasional ZK related slow abort. As > they are found and characterized, I've been filing them under HBASE-6055 > (offline-snapshots) or HBASE-7290 (online-snapshots). > > I'm going to switch from bug fixing mode back to patch polishing mode today > to get some of this committed to the snapshot dev branch. Here's how I > hope to deal with them moving forward. > > I'll be polishing the pieces I've been testing (there are about 5-7 patches > in-flight currently) and putting updated pieces up for review. There is > non-trivial overhead maintaining this many patches "in the future". Since > this is a dev-branch, I'm going to ask reviewing these initial big > dev-branch reviews focus on understandability and that your +1's would let > us punt to follow-on jiras and TODOs more frequently than if you were > reviewing for trunk. The sooner we get the skeleton in, the easier > collaboration with other folks working and testing the same branch. > Ideally, getting the large pieces in would allow follow-ons to be easier > to review and tackle. The promise here, of course, is that many of these > follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be > blockers before merging to offline snapshots to trunk and merging online > snapshots to trunk. > > Sound good? > > We've initially had one snapshot branch (offline snapshots) but I'm > proposing having two: the offline-snapshot branch and the online-snapshot > branch. Jesse's been the master of the offline branch and pushing > dev-branch patches to that branch ( > https://github.com/jyates/hbase/tree/snapshots). I'd like to soon begin > pushing dev-branch *reviewed commits* for online-snapshots to another > branch. For those following here's an explanation of how I'm working. > > * The latest for review patches will be always be in review boards. > * Branch committed portions (reviewed and +1'ed for the branch patches) for > online snapshots will live here > https://github.com/jmhsieh/hbase/tree/snapshots. My branch will > periodically be force pushed to deal with rebases onto constantly updating > trunk, and to include offline-branch committed patches. > * The latest working and consolidated online-snapshot branch (commits > correspond to HBASE jiras) will live at > https://github.com/jmhsieh/hbase/tree/snapshots-work . This branch is > subject to frequent forced pushes. It is a cleanup step done to prep > patches for reviews, and match what eventual commits structure would look > like. It also contains some patches that may be abandoned or reordered. > * Rough incremental in-progress branches live here, > https://github.com/jmhsieh/hbase/tree/snapshot-work-1213 (change 1213 > with > the latest date to see where I am). These rough branches have many small > commits that focus on functionality and need to be rebased to "sprinkle" > edits into the appropriate JIRA-corresponding patches. These branches > will rarely if ever be force pushed. These are what I do testing from, > and probably are suitable for others to use for testing. I periodically > merge this with the snapshots-work mostly as a proof that what I have for > review is the same as what I've been testing. > > Jon. > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [email protected] >
