Loving the extensive testing Jon - good stuff.
Basically, there are two meta reads -- once to get the list of servers > involved, and once after the snapshot is taken to verify that all regions > in the snapshot matchup with the snapshots in meta at that point in time. > > I believe moves/balances when snapshot is going will cause some rs's to > potentially be missed, and that and spilts may make regions new regions > appear in meta that do not exist in a just taken snapshot and thus cause > the snapshot verifier to reject the snapshot. > Yeah, that's the right intuition, as long as nothing has really changed in the code, from what I remember :) ------------------- Jesse Yates @jesse_yates jyates.github.com On Fri, Dec 14, 2012 at 10:08 AM, Jonathan Hsieh <[email protected]> wrote: > > > Jon. > > On Fri, Dec 14, 2012 at 9:37 AM, Ted Yu <[email protected]> wrote: > > > Thanks for the update, Jon. > > > > bq. if splits or balancing occurs while a snapshotting, the region moves > > cause the final snapshot verification step to abort > > > > The split or balancing happened during snapshot verification step, right > ? > > > > On Fri, Dec 14, 2012 at 9:17 AM, Jonathan Hsieh <[email protected]> > wrote: > > > > > Hey folks, > > > > > > I've been testing and finding bugs on a branch of online snapshots for > > the > > > past few days. The good news is that taking an online snapshot seems to > > be > > > fairly robust -- I've been taking online-snapshots as quickly as > possible > > > on a 5 node cluster being battered by a performance eval random write > > run. > > > > > > > > > As expected we ran into some hiccups. In my last run of the > > > PE/online-snapshotting, it looks like 88/100 snapshots succeeded. This > is > > > ok, some failures are actually expected (the first cut only claims > better > > > consistency than 'copytable' and 'only-on-a-sunny-day' semantics). > From a > > > quick viewing of what cause the failed cases, if splits or balancing > > > occurs while a snapshotting, the region moves cause the final snapshot > > > verification step to abort because we look for the new regions and > don't > > > know if we have all regions. We've also found some problems with > splits > > of > > > hfilelinks (HBASE-7339), and we've encountered an occasional > failed-hang > > > clone attempts (HBASE-7352), and an occasional ZK related slow abort. > As > > > they are found and characterized, I've been filing them under > HBASE-6055 > > > (offline-snapshots) or HBASE-7290 (online-snapshots). > > > > > > I'm going to switch from bug fixing mode back to patch polishing mode > > today > > > to get some of this committed to the snapshot dev branch. Here's how I > > > hope to deal with them moving forward. > > > > > > I'll be polishing the pieces I've been testing (there are about 5-7 > > patches > > > in-flight currently) and putting updated pieces up for review. There > is > > > non-trivial overhead maintaining this many patches "in the future". > > Since > > > this is a dev-branch, I'm going to ask reviewing these initial big > > > dev-branch reviews focus on understandability and that your +1's would > > let > > > us punt to follow-on jiras and TODOs more frequently than if you were > > > reviewing for trunk. The sooner we get the skeleton in, the easier > > > collaboration with other folks working and testing the same branch. > > > Ideally, getting the large pieces in would allow follow-ons to be > easier > > > to review and tackle. The promise here, of course, is that many of > > these > > > follow-on jiras, bugs (deadlocks, hangs), and testing evidence will be > > > blockers before merging to offline snapshots to trunk and merging > online > > > snapshots to trunk. > > > > > > Sound good? > > > > > > We've initially had one snapshot branch (offline snapshots) but I'm > > > proposing having two: the offline-snapshot branch and the > online-snapshot > > > branch. Jesse's been the master of the offline branch and pushing > > > dev-branch patches to that branch ( > > > https://github.com/jyates/hbase/tree/snapshots). I'd like to soon > begin > > > pushing dev-branch *reviewed commits* for online-snapshots to another > > > branch. For those following here's an explanation of how I'm working. > > > > > > * The latest for review patches will be always be in review boards. > > > * Branch committed portions (reviewed and +1'ed for the branch patches) > > for > > > online snapshots will live here > > > https://github.com/jmhsieh/hbase/tree/snapshots. My branch will > > > periodically be force pushed to deal with rebases onto constantly > > updating > > > trunk, and to include offline-branch committed patches. > > > * The latest working and consolidated online-snapshot branch (commits > > > correspond to HBASE jiras) will live at > > > https://github.com/jmhsieh/hbase/tree/snapshots-work . This branch is > > > subject to frequent forced pushes. It is a cleanup step done to prep > > > patches for reviews, and match what eventual commits structure would > look > > > like. It also contains some patches that may be abandoned or > reordered. > > > * Rough incremental in-progress branches live here, > > > https://github.com/jmhsieh/hbase/tree/snapshot-work-1213 (change 1213 > > > with > > > the latest date to see where I am). These rough branches have many > small > > > commits that focus on functionality and need to be rebased to > "sprinkle" > > > edits into the appropriate JIRA-corresponding patches. These branches > > > will rarely if ever be force pushed. These are what I do testing > from, > > > and probably are suitable for others to use for testing. I > periodically > > > merge this with the snapshots-work mostly as a proof that what I have > for > > > review is the same as what I've been testing. > > > > > > Jon. > > > > > > -- > > > // Jonathan Hsieh (shay) > > > // Software Engineer, Cloudera > > > // [email protected] > > > > > > > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [email protected] >
