Time to abort the vote then? I'd like to get this fix into 1.0.1 if possible.
On 8 Aug 2010, at 02:28, Damien Katz wrote: > Thanks. > > Anyone up to create a repair tool for when this happens? It should be > possible to find the previous header, then find the most recent btree roots, > find the high seq and apply them to the header and commit. I'm thinking this > would be a one time server upgrade script. > > -Damien > > > On Aug 7, 2010, at 5:47 PM, Adam Kocoloski wrote: > >> Committed to trunk and 1.0.x. >> >> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote: >> >>> http://github.com/tilgovi/couchdb/tree/fixlostcommits >>> >>> Test and fix in separate commits at the end of that branch, based off >>> current trunk. >>> Would appreciate verification that the test is initially broken but >>> fixed by the patch. >>> >>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <[email protected]> wrote: >>>> I reproduced this manually: >>>> >>>> Create document with id "x", ensure full commit (simply wait longer than 1 >>>> sec, say 2 secs). >>>> >>>> Attempt to create document "x" again, get conflict error. >>>> >>>> Wait at least 2 secs to ensure the delayed commit attempt happens. >>>> >>>> Now create document "y". >>>> >>>> Wait at least 2 secs because the delayed commit should happen >>>> >>>> Restart server. >>>> >>>> Document "y" is now missing. >>>> >>>> The last delayed commit isn't happening. From then on out, no docs updated >>>> with delayed commit with be available after a restart. >>>> >>>> -Damien >>>> >>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote: >>>> >>>>> I believe it's a single delayed conflict write attempt and no successes >>>>> in that same interval. >>>>> >>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote: >>>>> >>>>>> Looks like all that's necessary is a single delayed conflict write >>>>>> attempt, and all subsequent delayed commits won't be commit, the header >>>>>> never gets written. >>>>>> >>>>>> 1.0 loses data. This is ridiculously bad. >>>>>> >>>>>> We need a test to reproduce this and fix. >>>>>> >>>>>> -Damien >>>>>> >>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote: >>>>>> >>>>>>> Good sleuthing guys, and my apologies for letting this through. >>>>>>> Randall, your patch in COUCHDB-794 was actually fine, it was my >>>>>>> reworking of it that caused this serious bug. >>>>>>> >>>>>>> With respect to that gist 513282, I think it would be better to return >>>>>>> Db#db{waiting_delayed_commit=nil} when the headers match instead of >>>>>>> moving the cancel_timer() command as you did. After all, we did >>>>>>> perform the check here -- it was just that nothing needed to be >>>>>>> committed. >>>>>>> >>>>>>> Adam >>>>>>> >>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote: >>>>>>> >>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs >>>>>>>> to trigger the delayed_commit timer without actually having anything >>>>>>>> to commit, so the header never changes. >>>>>>>> >>>>>>>> Try to reproduce this and add a test case. >>>>>>>> >>>>>>>> -Damien >>>>>>>> >>>>>>>> >>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote: >>>>>>>> >>>>>>>>> I think you may be right, Damien. >>>>>>>>> If ever a write happens that only contains conflicts while waiting for >>>>>>>>> a delayed commit message we might still be cancelling the timer. Is >>>>>>>>> this what you're thinking? This would be the fix: >>>>>>>>> http://gist.github.com/513282 >>>>>>>>> >>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <[email protected]> wrote: >>>>>>>>>> I think the problem might be that 2 conflicting write attempts in >>>>>>>>>> row can leave the #db.waiting_delayed_commit set but the timer has >>>>>>>>>> been cancelled. One that happens, the header may never be written, >>>>>>>>>> as it always thinks a delayed commit will fire soon. >>>>>>>>>> >>>>>>>>>> -Damien >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote: >>>>>>>>>> >>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds >>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote >>>>>>>>>>>> that >>>>>>>>>>>> change. I spent a while staring at it last night but still can't >>>>>>>>>>>> imagine how it's a problem. >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further. >>>>>>>>>>>>> >>>>>>>>>>>>> -Damien >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at >>>>>>>>>>> least one situation where we do not commit data correctly during >>>>>>>>>>> compaction. This has to do with the way we now use the path to sync >>>>>>>>>>> outside the couch_file:process. Check this diff: >>>>>>>>>>> http://gist.github.com/513081 >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >> >
