On Wed, Aug 11, 2010 at 3:52 AM, Adam Kocoloski <[email protected]> wrote:
> Excellent, thanks for testing. I caught Jason Smith saying on IRC that he > had packaged the whole thing up as an escript + some .beams. If we can get > it down to a single file a la rebar that would be a pretty sweet way to > deliver the repair tool in my opinion. > +1 > > Adam > > On Aug 10, 2010, at 10:40 PM, Mikeal Rogers wrote: > > > Ok, latest code has been tested against every db that I have and it works > > great. > > > > What are our next steps here? > > > > I'd like to get this out to all the people who didn't feel comfortable > send > > me their db to test against before we release it more widely. > > > > -Mikeal > > > > On Tue, Aug 10, 2010 at 6:11 PM, Mikeal Rogers <[email protected] > >wrote: > > > >> Found one issue, we weren't picking up design docs because it didn't > have > >> admin privileges. > >> > >> Adam fixed it and pushed and I've verified that it works now. > >> > >> I wrote a little node script to show all recovered documents and expose > any > >> documents that didn't make it in to lost+found. > >> > >> http://github.com/mikeal/couchtest/blob/master/validate.js > >> > >> Requires request, `npm install request`. > >> > >> I'm now running recover on all the test db's I have and running the > >> validation script against them. > >> > >> -Mikeal > >> > >> > >> On Tue, Aug 10, 2010 at 1:34 PM, Mikeal Rogers <[email protected] > >wrote: > >> > >>> I have some timing number for the new code. > >>> > >>> multi_conflict has 200 lost documents and 201 documents total after > >>> recovery. > >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["multi_conflict"]). > >>> {25217069,ok} > >>> 25 seconds > >>> > >>> Something funky is going on here. Investigating. > >>> 1> timer:tc(couch_db_repair, make_lost_and_found, > >>> ["multi_conflict_with_attach"]). > >>> {654782,ok} > >>> .6 seconds > >>> > >>> This db has 124969 documents in it. > >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["testwritesdb"]). > >>> {1381969304,ok} > >>> 23 minutes > >>> > >>> This database is about 500megs and 46660 before recovery and 46801 > after. > >>> 1> timer:tc(couch_db_repair, make_lost_and_found, ["prod"]). > >>> {2329669113,ok} > >>> 38.8 minutes > >>> > >>> -Mikeal > >>> > >>> On Tue, Aug 10, 2010 at 12:06 PM, Adam Kocoloski <[email protected] > >wrote: > >>> > >>>> Good idea. Now we've got > >>>> > >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576 > >>>> bytes at 1380102 > >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 1048576 > >>>> bytes at 331526 > >>>>> [info] [<0.33.0>] couch_db_repair for testwritesdb - scanning 331526 > >>>> bytes at 0 > >>>>> [info] [<0.33.0>] couch_db_repair writing 12 updates to > >>>> lost+found/testwritesdb > >>>>> [info] [<0.33.0>] couch_db_repair writing 9 updates to > >>>> lost+found/testwritesdb > >>>>> [info] [<0.33.0>] couch_db_repair writing 8 updates to > >>>> lost+found/testwritesdb > >>>> > >>>> Adam > >>>> > >>>> On Aug 10, 2010, at 2:29 PM, Robert Newson wrote: > >>>> > >>>>> It took 20 minutes before the first 'update' line came out, but now > >>>>> seems to be recovering smoothly. machine load is back down to sane > >>>>> levels. > >>>>> > >>>>> Suggest feedback during the hunting phase. > >>>>> > >>>>> B. > >>>>> > >>>>> On Tue, Aug 10, 2010 at 7:11 PM, Adam Kocoloski <[email protected] > > > >>>> wrote: > >>>>>> Thanks for the crosscheck. I'm not aware of anything in the node > >>>> finder that would cause it to struggle mightily with healthy DBs. It > pretty > >>>> much ignores the health of the DB, in fact. Would be interested to > hear > >>>> more. > >>>>>> > >>>>>> On Aug 10, 2010, at 1:59 PM, Robert Newson wrote: > >>>>>> > >>>>>>> I verified the new code's ability to repair the testwritesdb. > system > >>>>>>> load was smooth from start to finish. > >>>>>>> > >>>>>>> I started a further test on a different (healthy) database and > system > >>>>>>> load was severe again, just collecting the roots (the lost+found db > >>>>>>> was not yet created when I aborted the attempt). I suspect the fact > >>>>>>> that it's healthy is the issue, so if I'm right, perhaps a warning > is > >>>>>>> useful. > >>>>>>> > >>>>>>> B. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Aug 10, 2010 at 6:53 PM, Adam Kocoloski < > [email protected]> > >>>> wrote: > >>>>>>>> Another update. This morning I took a different tack and, rather > >>>> than try to find root nodes, I just looked for all kv_nodes in the > file and > >>>> treated each of those as a separate virtual DB to be replicated. This > >>>> reduces the algorithmic complexity of the repair, and it looks like > >>>> testwritesdb repairs in ~30 minutes or so. Also, this method results > in the > >>>> lost+found DB containing every document, not just the missing ones. > >>>>>>>> > >>>>>>>> My branch does not currently include Randall's parallelization of > >>>> the replications. It's still CPU-limited, so that may be a worthwhile > >>>> optimization. On the other hand, I think we may be reaching a stage > at > >>>> which performance for this repair tool is 'good enough', and pmaps can > make > >>>> error handling a bit dicey. > >>>>>>>> > >>>>>>>> In short, I think this tool is now in good shape. > >>>>>>>> > >>>>>>>> http://github.com/kocolosk/couchdb/tree/db_repair > >>>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >>> > >> > > -- Filipe David Manana, [email protected], [email protected] "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men."
