Also, an abrupt shutdown with delayed_commits=true might orphan some data anyway, so the warning on startup might encourage more people to disable it.
On Fri, Aug 13, 2010 at 3:49 PM, Robert Newson <[email protected]> wrote: > A user (#herman) on IRC today reported slow startups for couchdb. I > speculated that he'd hit the data loss bug and that couchdb was > scanning backwards for a header. This turned out to be the case. > Interestingly this was verified with a strace call, watching the read > calls use earlier and earlier offsets. > > Should we consider a tweak to the tool, or couchdb itself, to report a > warning if we have to seek back very far to find a header? Obviously > it would a heuristic but there would be no real downside to the odd > false positive since the recovery tool and subsequent replication will > amount to a no-op. > > fyi: Reading the couchdb database (90G) with 'dd' took 22 minutes, but > couchdb's backward scanning took 3 hours. > > B. > > On Fri, Aug 13, 2010 at 3:05 PM, J Chris Anderson <[email protected]> wrote: >> >> On Aug 12, 2010, at 11:38 PM, Mikeal Rogers wrote: >> >>> I tested the latest code in recover-couchdb and it looks great. >> >> We need to package this so that it is useable by end-users, and put a link >> to it on http://couchdb.apache.org/notice/1.0.1.html >> >> I'm the last guy who knows what that would mean... anyone? I think we should >> do this today. >> >> Do we need to do anything formal and time consuming before linking to the >> recovery tool / process from that page? >> >> Also, someone needs to write up the how-to instructions, along with a >> description of what to expect. >> >> Chris >> >>> >>> -Mikeal >>> >>> On Thu, Aug 12, 2010 at 2:33 PM, J Chris Anderson <[email protected]> wrote: >>> >>>> >>>> On Aug 12, 2010, at 2:15 PM, J Chris Anderson wrote: >>>> >>>>> >>>>> On Aug 12, 2010, at 12:36 PM, Adam Kocoloski wrote: >>>>> >>>>>> Right, and jchris' db_repair branch includes my patches for DB reader >>>> _admin access and a more useful progress report in the replication phase of >>>> the repair. >>>>>> >>>>> >>>>> I've updated the repair branch with everyone's code. I think it is >>>> faster, due to Adam's idea that if we run the merges in reverse order, >>>> those >>>> near the front of the file are more likely to be no-ops, so less work is >>>> done over all. >>>>> >>>>> Mikeal will be testing for correctness. Could other's please use it and >>>> test for usability as well. Latest code (with instructions) is here: >>>>> >>>>> http://github.com/jhs/recover-couchdb/ >>>>> >>>>> Which points at http://github.com/jchris/couchdb/tree/db_repair for the >>>> repair code. >>>>> >>>>> One thing I am not clear about (need better docs) is, do we need to >>>> replicate the original db to the lost+found db (or vice-versa), after >>>> recovery is complete? >>>>> >>>> >>>> Also, we should be clear about what the semantics for this are. It can >>>> potentially introduce conflicts if some writes were repeated after >>>> restarts. >>>> Should it always be a noop on dbs that are clean w/r/t the bug? >>>> >>>> Chris >>>> >>>>> Chris >>>>> >>>>>> Adam >>>>>> >>>>>> On Aug 12, 2010, at 3:14 PM, Jason Smith wrote: >>>>>> >>>>>>> The code is updated with the following changes: >>>>>>> 1. Adhere to the lost+found/databasename custom... >>>>>>> 2. ...except databases starting with _, which goes into >>>>>>> _system/databasename >>>>>>> 3. Sync up with jchris's db_repair branch >>>>>>> >>>>>>> (About #2, I started with _/database but I think it's too easy to miss >>>> at >>>>>>> the command line.) >>>>>>> >>>>>>> On Fri, Aug 13, 2010 at 00:52, J Chris Anderson <[email protected]> >>>> wrote: >>>>>>> >>>>>>>> A few bug reports from my testing: >>>>>>>> >>>>>>>> I launched with this command, as specified in the README: >>>>>>>> >>>>>>>> find ~/code/couchdb/tmp/lib -type f -name '*.couch' -exec >>>> ./recover_couchdb >>>>>>>> {} \; >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> First of all, it chokes on my _users and _replicator db: >>>>>>>> >>>>>>>> [info] [<0.2.0>] couch_db_repair for _users - scanning 335961 bytes at >>>> 0 >>>>>>>> [error] [<0.2.0>] couch_db_repair merge node at 332061 {case_clause, >>>>>>>> {error,illegal_database_name}} >>>>>>>> >>>>>>>> That second [error] line is repeated many many times (once per merge I >>>>>>>> think). I think the issue is that _users is hard-coded to be OK, but >>>>>>>> _users_lost+found is not. So we should do something about that, maybe >>>> if a >>>>>>>> db-name starts with _ we should call the lost and found >>>> a_users_lost+found >>>>>>>> (_ sorts at the top, so "a" will be near it and legal). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> When a database has readers defined in the security object, the tool >>>> is >>>>>>>> unable to open them (the reading part of the repair tool needs to have >>>> the >>>>>>>> _admin userCtx, not just the writer). >>>>>>>> >>>>>>>> [debug] [<0.2.0>] Not a reader: UserCtx {user_ctx,null,[],undefined} >>>> vs >>>>>>>> Names [<<"joe">>] Roles [<<"_admin">>] >>>>>>>> escript: exception throw: {unauthorized,<<"You are not authorized to >>>> access >>>>>>>> this db.">>} >>>>>>>> in function couch_db:open/2 >>>>>>>> in call from couch_db_repair:make_lost_and_found/3 >>>>>>>> in call from recover_couchdb:main/1 >>>>>>>> in call from escript:run/2 >>>>>>>> in call from escript:start/1 >>>>>>>> in call from init:start_it/1 >>>>>>>> in call from init:start_em/1 >>>>>>>> >>>>>>>> >>>>>>>> It would also be helpful if the status lines could say something more >>>> than >>>>>>>> >>>>>>>> [info] [<0.2.0>] couch_db_repair writing 15 updates to >>>> bench_lost+found >>>>>>>> >>>>>>>> Like maybe add a note like "about 23% complete" if at all possible. >>>>>>>> >>>>>>>> >>>>>>>> I will patch the first few, I'd love help from someone on the last >>>> one. >>>>>>>> I'll be on IRC. >>>>>>>> >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Chris >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Aug 12, 2010, at 10:18 AM, J Chris Anderson wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> On Aug 11, 2010, at 2:14 PM, Jason Smith wrote: >>>>>>>>> >>>>>>>>>> Hi, Jason. >>>>>>>>>> >>>>>>>>>> On Thu, Aug 12, 2010 at 04:14, Jason Smith <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> On Wed, Aug 11, 2010 at 09:52, Adam Kocoloski <[email protected] >>>>> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Excellent, thanks for testing. I caught Jason Smith saying on IRC >>>>>>>> that he >>>>>>>>>>>> had packaged the whole thing up as an escript + some .beams. If >>>> we >>>>>>>> can get >>>>>>>>>>>> it down to a single file a la rebar that would be a pretty sweet >>>> way >>>>>>>> to >>>>>>>>>>>> deliver the repair tool in my opinion. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Please check out http://github.com/jhs/repair-couchdb >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think you mean http://github.com/jhs/recover-couchdb >>>>>>>>>> >>>>>>>>> >>>>>>>>> I think it is important that we package and release this, if it is >>>> ready. >>>>>>>> We should link to it from the bug description page, the project home >>>> page, >>>>>>>> as well as blog about it, etc. What is the point of working feverishly >>>> on a >>>>>>>> recovery tool if we don't go the last mile? >>>>>>>>> >>>>>>>>> I am testing it now on my database directory to make sure it doesn't >>>> harm >>>>>>>> anything (I was never subject to the bug, which is probably where most >>>>>>>> people are, but they might run it anyway.) >>>>>>>>> >>>>>>>>> As it stands the submodules thing can't be part of the release, we >>>> need >>>>>>>> to package it up as a single zip file or something. >>>>>>>>> >>>>>>>>> Is there anything else that needs to be done before we can release >>>> this? >>>>>>>>> >>>>>>>>> Chris >>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Jason Smith >>>>>>>>>> Couchio Hosting >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jason Smith >>>>>>> Couchio Hosting >>>>>> >>>>> >>>> >>>> >> >> >
