Time to abort the vote then?

I'd like to get this fix into 1.0.1 if possible.

On 8 Aug 2010, at 02:28, Damien Katz wrote:

> Thanks.
> 
> Anyone up to create a repair tool for when this happens? It should be 
> possible to find the previous header, then find the most recent btree roots, 
> find the high seq and apply them to the header and commit. I'm thinking this 
> would be a one time server upgrade script.
> 
> -Damien
> 
> 
> On Aug 7, 2010, at 5:47 PM, Adam Kocoloski wrote:
> 
>> Committed to trunk and 1.0.x.
>> 
>> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
>> 
>>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>>> 
>>> Test and fix in separate commits at the end of that branch, based off
>>> current trunk.
>>> Would appreciate verification that the test is initially broken but
>>> fixed by the patch.
>>> 
>>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <[email protected]> wrote:
>>>> I reproduced this manually:
>>>> 
>>>> Create document with id "x", ensure full commit (simply wait longer than 1 
>>>> sec, say 2 secs).
>>>> 
>>>> Attempt to create document "x" again, get conflict error.
>>>> 
>>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>>> 
>>>> Now create document "y".
>>>> 
>>>> Wait at least 2 secs because the delayed commit should happen
>>>> 
>>>> Restart server.
>>>> 
>>>> Document "y" is now missing.
>>>> 
>>>> The last delayed commit isn't happening. From then on out, no docs updated 
>>>> with delayed commit with be available after a restart.
>>>> 
>>>> -Damien
>>>> 
>>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>>> 
>>>>> I believe it's a single delayed conflict write attempt and no successes 
>>>>> in that same interval.
>>>>> 
>>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>>> 
>>>>>> Looks like all that's necessary is a single delayed conflict write 
>>>>>> attempt, and all subsequent delayed commits won't be commit, the header 
>>>>>> never gets written.
>>>>>> 
>>>>>> 1.0 loses data. This is ridiculously bad.
>>>>>> 
>>>>>> We need a test to reproduce this and fix.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>>> 
>>>>>>> Good sleuthing guys, and my apologies for letting this through.  
>>>>>>> Randall, your patch in COUCHDB-794 was actually fine, it was my 
>>>>>>> reworking of it that caused this serious bug.
>>>>>>> 
>>>>>>> With respect to that gist 513282, I think it would be better to return 
>>>>>>> Db#db{waiting_delayed_commit=nil} when the headers match instead of 
>>>>>>> moving the cancel_timer() command as you did.  After all, we did 
>>>>>>> perform the check here -- it was just that nothing needed to be 
>>>>>>> committed.
>>>>>>> 
>>>>>>> Adam
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>>> 
>>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs 
>>>>>>>> to trigger the delayed_commit timer without actually having anything 
>>>>>>>> to commit, so the header never changes.
>>>>>>>> 
>>>>>>>> Try to reproduce this and add a test case.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>>> 
>>>>>>>>> I think you may be right, Damien.
>>>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>>> http://gist.github.com/513282
>>>>>>>>> 
>>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <[email protected]> wrote:
>>>>>>>>>> I think the problem might be that 2 conflicting write attempts in 
>>>>>>>>>> row can leave the #db.waiting_delayed_commit set but the timer has 
>>>>>>>>>> been cancelled. One that happens, the header may never be written, 
>>>>>>>>>> as it always thinks a delayed commit will fire soon.
>>>>>>>>>> 
>>>>>>>>>> -Damien
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds 
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote 
>>>>>>>>>>>> that
>>>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <[email protected]> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Damien
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
> 

Reply via email to