slight correction, this was with delayed_commits=false. My framework does a PUT to ensure that on every test run.
B. On Tue, Aug 10, 2010 at 9:55 AM, Robert Newson <[email protected]> wrote: > In ran the db_repair code on a healthy database produced with > delayed_commits=true. > > The source db had 3218 docs. db_repair recovered 3120 and then returned with > ok. > > I'm redoing that test, but this indicates we're not finding all roots. > > I note that the output file was 36 times the input file, which is a > consequence of folding all possible roots. I think that needs to be in > the release notes for the repair tool if that behavior remains when it > ships. > > B. > > On Tue, Aug 10, 2010 at 9:09 AM, Mikeal Rogers <[email protected]> > wrote: >> I think I found a bug in the current lost+found repair. >> >> I've been running it against the testwritesdb and it's in a state that is >> never finishing. >> >> It's still spitting out these lines: >> >> [info] [<0.32.0>] writing 1001 updates to lost+found/testwritesdb >> >> Most are 1001 but there are also other random variances 452, 866, etc. >> >> But the file size and dbinfo hasn't budged in over 30 minutes. The size is >> stuck at 34300002 with the original db file being 54857478 . >> >> This database only has one document in it that isn't "lost" so if it's >> finding *any* new docs it should be writing them. >> >> I also started another job to recover a production db that is quite large, >> 500megs, with the missing data a week or so back. This has been running for >> 2 hours and has still not output anything or created the lost and found db >> so I can only assume that it is in the same state. >> >> Both machines are still churning 100% CPU. >> >> -Mikeal >> >> >> On Mon, Aug 9, 2010 at 11:26 PM, Adam Kocoloski <[email protected]> wrote: >> >>> With Randall's help we hooked the new node scanner up to the lost+found DB >>> generator. It seems to work well enough for small DBs; for large DBs with >>> lots of missing nodes the O(N^2) complexity of the problem catches up to the >>> code and generating the lost+found DB takes quite some time. Mikeal is >>> running tests tonight. The algo appears pretty CPU-limited, so a little >>> parallelization may be warranted. >>> >>> http://github.com/kocolosk/couchdb/tree/db_repair >>> >>> Adam >>> >>> (I sent this previous update to myself instead of the list, so I'll forward >>> it here ...) >>> >>> On Aug 10, 2010, at 12:01 AM, Adam Kocoloski wrote: >>> >>> > On Aug 9, 2010, at 10:10 PM, Adam Kocoloski wrote: >>> > >>> >> Right, make_lost_and_found still relies on code which reads through >>> couch_file one byte at a time, that's the cause of the slowness. The newer >>> scanner will improve that pretty dramatically, and we can tune it further by >>> increasing the length of the pattern that we match when looking for >>> kp/kv_node terms in the files, at the expense of some extra complexity >>> dealing with the block prefixes (currently it does a 1-byte match, which as >>> I understand it cannot be split across blocks). >>> > >>> > The scanner now looks for a 7 byte match, unless it is within 6 bytes of >>> a block boundary, in which case it looks for the longest possible match at >>> that position. The more specific match condition greatly reduces the # of >>> calls to couch_file, and thus boosts the throughput. On my laptop it can >>> scan the testwritesdb.couch from Mikeal's couchtest repo (52 MB) in 18 >>> seconds. >>> > >>> >> Regarding the file_corruption error on the larger file, I think this is >>> something we will just naturally trigger when we take a guess that random >>> positions in a file are actually the beginning of a term. I think our best >>> recourse here is to return {error, file_corruption} from couch_file but >>> leave the gen_server up and running instead of terminating it. That way the >>> repair code can ignore the error and keep moving without having to reopen >>> the file. >>> > >>> > I committed this change (to my db_repair branch) after consulting with >>> Chris. The longer match condition makes these spurious file_corruption >>> triggers much less likely, but I think it's still a good thing not to crash >>> the server when they happen. >>> > >>> >> Next steps as I understand them - Randall is working on integrating the >>> in-memory scanner into Volker's code that finds all the dangling by_id >>> nodes. I'm working on making sure that the scanner identifies bt node >>> candidates which span block prefixes, and on improving its pattern-matching. >>> > >>> > Latest from my end >>> > http://github.com/kocolosk/couchdb/tree/db_repair >>> > >>> >> >>> >> Adam >>> >> >>> >> On Aug 9, 2010, at 9:50 PM, Mikeal Rogers wrote: >>> >> >>> >>> I pulled down the latest code from Adam's branch @ >>> >>> 7080ff72baa329cf6c4be2a79e71a41f744ed93b. >>> >>> >>> >>> Running timer:tc(couch_db_repair, make_lost_and_found, >>> ["multi_conflict"]). >>> >>> on a database with 200 lost updates spanning 200 restarts ( >>> >>> http://github.com/mikeal/couchtest/blob/master/multi_conflict.couch ) >>> took >>> >>> about 101 seconds. >>> >>> >>> >>> I tried running against a larger databases ( >>> >>> http://github.com/mikeal/couchtest/blob/master/testwritesdb.couch ) >>> and I >>> >>> got this exception: >>> >>> >>> >>> http://gist.github.com/516491 >>> >>> >>> >>> -Mikeal >>> >>> >>> >>> >>> >>> >>> >>> On Mon, Aug 9, 2010 at 6:09 PM, Randall Leeds <[email protected] >>> >wrote: >>> >>> >>> >>>> Summing up what went on in IRC for those who were absent. >>> >>>> >>> >>>> The latest progress is on Adam's branch at >>> >>>> http://github.com/kocolosk/couchdb/tree/db_repair >>> >>>> >>> >>>> couch_db_repair:make_lost_and_found/1 attempts to create a new >>> >>>> lost+found/DbName database to which it merges all nodes not accessible >>> >>>> from anywhere (any other node found in a full file scan or any header >>> >>>> pointers). >>> >>>> >>> >>>> Currently, make_lost_and_found uses Volker's repair (from >>> >>>> couch_db_repair_b module, also in Adam's branch). >>> >>>> Adam found that the bottleneck was couch_file calls and that the >>> >>>> repair process was taking a very long time so he added >>> >>>> couch_db_repair:find_nodes_quickly/1 that reads 1MB chunks as binary >>> >>>> and tries to process it to find nodes instead of scanning back one >>> >>>> byte at a time. It is currently not hooked up to the repair mechanism. >>> >>>> >>> >>>> Making progress. Go team. >>> >>>> >>> >>>> On Mon, Aug 9, 2010 at 13:52, Mikeal Rogers <[email protected]> >>> >>>> wrote: >>> >>>>> jchris suggested on IRC that I try a normal doc update and see if >>> that >>> >>>> fixes >>> >>>>> it. >>> >>>>> >>> >>>>> It does. After a new doc was created the dbinfo doc count was back to >>> >>>>> normal. >>> >>>>> >>> >>>>> -Mikeal >>> >>>>> >>> >>>>> On Mon, Aug 9, 2010 at 1:39 PM, Mikeal Rogers < >>> [email protected] >>> >>>>> wrote: >>> >>>>> >>> >>>>>> Ok, I pulled down this code and tested against a database with a ton >>> of >>> >>>>>> missing writes right before a single restart. >>> >>>>>> >>> >>>>>> Before restart this was the database: >>> >>>>>> >>> >>>>>> { >>> >>>>>> db_name: "testwritesdb" >>> >>>>>> doc_count: 124969 >>> >>>>>> doc_del_count: 0 >>> >>>>>> update_seq: 124969 >>> >>>>>> purge_seq: 0 >>> >>>>>> compact_running: false >>> >>>>>> disk_size: 54857478 >>> >>>>>> instance_start_time: "1281384140058211" >>> >>>>>> disk_format_version: 5 >>> >>>>>> } >>> >>>>>> >>> >>>>>> After restart it was this: >>> >>>>>> >>> >>>>>> { >>> >>>>>> db_name: "testwritesdb" >>> >>>>>> doc_count: 1 >>> >>>>>> doc_del_count: 0 >>> >>>>>> update_seq: 1 >>> >>>>>> purge_seq: 0 >>> >>>>>> compact_running: false >>> >>>>>> disk_size: 54857478 >>> >>>>>> instance_start_time: "1281384593876026" >>> >>>>>> disk_format_version: 5 >>> >>>>>> } >>> >>>>>> >>> >>>>>> After repair, it's this: >>> >>>>>> >>> >>>>>> { >>> >>>>>> db_name: "testwritesdb" >>> >>>>>> doc_count: 1 >>> >>>>>> doc_del_count: 0 >>> >>>>>> update_seq: 124969 >>> >>>>>> purge_seq: 0 >>> >>>>>> compact_running: false >>> >>>>>> disk_size: 54857820 >>> >>>>>> instance_start_time: "1281385990193289" >>> >>>>>> disk_format_version: 5 >>> >>>>>> committed_update_seq: 124969 >>> >>>>>> } >>> >>>>>> >>> >>>>>> All the sequences are there and hitting _all_docs shows all the >>> >>>> documents >>> >>>>>> so why is the doc_count only 1 in the dbinfo? >>> >>>>>> >>> >>>>>> -Mikeal >>> >>>>>> >>> >>>>>> On Mon, Aug 9, 2010 at 11:53 AM, Filipe David Manana < >>> >>>> [email protected]>wrote: >>> >>>>>> >>> >>>>>>> For the record (and people not on IRC), the code at: >>> >>>>>>> >>> >>>>>>> http://github.com/fdmanana/couchdb/commits/db_repair >>> >>>>>>> >>> >>>>>>> is working for at least simple cases. Use >>> >>>>>>> couch_db_repair:repair(DbNameAsString). >>> >>>>>>> There's one TODO: update the reduce values for the by_seq and >>> by_id >>> >>>>>>> BTrees. >>> >>>>>>> >>> >>>>>>> If anyone wants to give some help on this, your welcome. >>> >>>>>>> >>> >>>>>>> On Mon, Aug 9, 2010 at 6:12 PM, Mikeal Rogers < >>> [email protected] >>> >>>>>>>> wrote: >>> >>>>>>> >>> >>>>>>>> I'm starting to create a bunch of test db files that expose this >>> bug >>> >>>>>>> under >>> >>>>>>>> different conditions like multiple restarts, across compaction, >>> >>>>>>> variances >>> >>>>>>>> in >>> >>>>>>>> updates the might cause conflict, etc. >>> >>>>>>>> >>> >>>>>>>> http://github.com/mikeal/couchtest >>> >>>>>>>> >>> >>>>>>>> The README outlines what was done to the db's and what needs to be >>> >>>>>>>> recovered. >>> >>>>>>>> >>> >>>>>>>> -Mikeal >>> >>>>>>>> >>> >>>>>>>> On Mon, Aug 9, 2010 at 9:33 AM, Filipe David Manana < >>> >>>>>>> [email protected] >>> >>>>>>>>> wrote: >>> >>>>>>>> >>> >>>>>>>>> On Mon, Aug 9, 2010 at 5:22 PM, Robert Newson < >>> >>>>>>> [email protected] >>> >>>>>>>>>> wrote: >>> >>>>>>>>> >>> >>>>>>>>>> Doesn't this bit; >>> >>>>>>>>>> >>> >>>>>>>>>> - Db#db{waiting_delayed_commit=nil}; >>> >>>>>>>>>> + Db; >>> >>>>>>>>>> + % Db#db{waiting_delayed_commit=nil}; >>> >>>>>>>>>> >>> >>>>>>>>>> revert the bug fix? >>> >>>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> That's intentional, for my local testing. >>> >>>>>>>>> That patch isn't obviously anything close to final, it's too >>> >>>>>>> experimental >>> >>>>>>>>> yet. >>> >>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> B. >>> >>>>>>>>>> >>> >>>>>>>>>> On Mon, Aug 9, 2010 at 5:09 PM, Jan Lehnardt <[email protected]> >>> >>>>>>> wrote: >>> >>>>>>>>>>> Hi All, >>> >>>>>>>>>>> >>> >>>>>>>>>>> Filipe jumped in to start working on the recovery tool, but he >>> >>>>>>> isn't >>> >>>>>>>>> done >>> >>>>>>>>>> yet. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Here's the current patch: >>> >>>>>>>>>>> >>> >>>>>>>>>>> http://www.friendpaste.com/4uMngrym4r7Zz4R0ThSHbz >>> >>>>>>>>>>> >>> >>>>>>>>>>> it is not done and very early, but any help on this is greatly >>> >>>>>>>>>> appreciated. >>> >>>>>>>>>>> >>> >>>>>>>>>>> The current state is (in Filipe's words): >>> >>>>>>>>>>> - i can detect that a file needs repair >>> >>>>>>>>>>> - and get the last btree roots from it >>> >>>>>>>>>>> - "only" missing: get last db seq num >>> >>>>>>>>>>> - write new header >>> >>>>>>>>>>> - and deal with the local docs btree (if exists) >>> >>>>>>>>>>> >>> >>>>>>>>>>> Thanks! >>> >>>>>>>>>>> Jan >>> >>>>>>>>>>> -- >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> -- >>> >>>>>>>>> Filipe David Manana, >>> >>>>>>>>> [email protected] >>> >>>>>>>>> >>> >>>>>>>>> "Reasonable men adapt themselves to the world. >>> >>>>>>>>> Unreasonable men adapt the world to themselves. >>> >>>>>>>>> That's why all progress depends on unreasonable men." >>> >>>>>>>>> >>> >>>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> -- >>> >>>>>>> Filipe David Manana, >>> >>>>>>> [email protected] >>> >>>>>>> >>> >>>>>>> "Reasonable men adapt themselves to the world. >>> >>>>>>> Unreasonable men adapt the world to themselves. >>> >>>>>>> That's why all progress depends on unreasonable men." >>> >>>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>> >>> >>>> >>> >> >>> > >>> >>> >> >
