[chromium-dev] Re: Paper about DRAM error rates

John Abd-El-Malek Tue, 06 Oct 2009 17:15:05 -0700

On Tue, Oct 6, 2009 at 5:09 PM, Jeremy Orlow <jor...@chromium.org> wrote:


> On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek <j...@chromium.org>wrote:
>
>>
>>
>> On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano <c...@google.com> wrote:
>>
>>>  On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek <j...@chromium.org>
>>> wrote:
>>> > I'm not sure how Carlos is doing it?  Will we know if something is
>>> corrupt
>>> > just on load/save?
>>>
>>> Many sqlite calls can return sqlite_corrupt. For example a query or an
>>> insert
>>> We just check for error codes 1 to 26 with 5 or 6 of them being
>>> serious error such as sqlite_corrupt
>>>
>>> I am sure that random bit flip in memory and on disk is the cause of
>>> some crashes, this is probably the 'limit' factor of how low the crash
>>> rate of a perfect program deployed in millions of computers can go.
>>>
>>
>> The point I was trying to make is that the 'limit' factor as you put it is
>> proportional to memory usage.  Given our large memory consumption in the
>> browser process, the numbers from the paper imply dozens of corruptions just
>> in sqlite memory per user.  Even if only a small fraction of these are
>> harmful, spread over millions of users that's a lot of corruption.
>>
>
> For what it's worth:  This makes sense to me.  It seems like pulling SQLite
> into its own process would be helpful for the reasons you laid out.  I
> wonder if the only reason no one else has chimed in on this thread is that
> no one wants to have to implement it.  :-)
>

Chase is going to start investigating it (i.e. figure out what the cost in
doing it is, how much change it requires and ways of measuring the benefit).


>
>
>> But I am unsure how to calculate, for example a random bit flip on the
>>> backingstores, which add to at least 10M on most machines does not
>>> hurt, or in the middle of a cache entry, or in the data part of some
>>> structure.
>>>
>>>
>>>
>>>   I imagine there's no way we can know when corruption
>>> > happen in steady-state and the next query leads to some other browser
>>> memory
>>> > (or another database) getting corrupted?
>>> >
>>> > On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren <hu...@google.com> wrote:
>>> >>
>>> >> It will be helpful to get our own measurement on database failures.
>>> >> Carlos just added something like that.
>>> >>
>>> >> Huan
>>> >>
>>> >> On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek <j...@chromium.org>
>>> >> wrote:
>>> >> > Saw this on
>>> >> > slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
>>> >> > The conclusion is "an average of 25,000–75,000 FIT (failures in time
>>> per
>>> >> > billion hours of operation) per Mbit".
>>> >> > On my machine the browser process is usually > 100MB, so that
>>> averages
>>> >> > out
>>> >> > to 176 to 493 error per year, with those numbers having big variance
>>> >> > depending on the machine.  Since most users don't have ECC, which
>>> means
>>> >> > this
>>> >> > will lead to corruption.  Sqlite is a heavy user of memory, so even
>>> if
>>> >> > it's
>>> >> > 1/4 of the 100MB, that means we'll see an average of 40-120 errors
>>> >> > naturally
>>> >> > because of faulty DIMMs.
>>> >> > Given that sqlite corruption means (repeated) crashing of the
>>> browser
>>> >> > process, it seems this data heavily suggests we should separate
>>> sqlite
>>> >> > code
>>> >> > into a separate process.  The IPC overhead is negligible compared to
>>> >> > disk
>>> >> > access.  My hunch is that the complexity is also not that high,
>>> since
>>> >> > the
>>> >> > code that deals with it is already asynchronous since we don't use
>>> >> > sqlite on
>>> >> > the UI/IO threads.
>>> >> > What do others think?
>>> >> > >> >
>>> >> >
>>> >
>>> >
>>>
>>
>>
>> >>
>>
>

--~--~---------~--~----~------------~-------~--~----~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
    http://groups.google.com/group/chromium-dev
-~----------~----~----~----~------~----~------~--~---

[chromium-dev] Re: Paper about DRAM error rates

Reply via email to