date:20230309

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev

From an operator's view, I think the most reliable indicator is not the
total count of corruption events, but the frequency of the events. Let
me try to explain that over some examples:

1. many corruption events in short period of time, then nothing after that
The disk is probably still healthy.
The spike in corruption events could be the result of reading some
bad blocks that hasn't been accessed for a long time
A warning in the log is preferred.
2. sparse corruption events over many years, the total number is high
The disk is probably still healthy.
As long as the frequency does not have an obvious increasing trend,
it should be fine.
A warning in the log is preferred.
3. clusters of corruption events started recently and continues to
happen for days or weeks
The disk is probably faulty.
Unless the access pattern from the application side has changed,
this is a fairly reliable indicator that the disk has failed or is
about to.
Initially, a warning in the log is preferred. If this persists for
too long (configurable number of days?), raise the severity level to
error, and depending on the disk_failure_policy, may stop or kill
the node.
4. many corruption events happening continuously
The disk is probably faulty.
Other than faulty disk or damaged data (e.g. data getting
overwritten by a rogue application, like a virus), nothing else
could explain this situation.
An error in the log is preferred, and depending on the
disk_failure_policy, may stop or kill the node.

Internally, inside Cassandra, this could be implemented as a fixed
number of scaling sized time buckets, arranged in such way that the
event frequency over different sized time window can be calculated and
compared to other recent time windows of the same size.
For example: 24x hourly buckets, 30x daily buckets and 24x monthly
buckets will only need to store 78 integers, but will show the
difference between the above 4 examples.
Externally, exposing those time buckets via the MBeans should be
sufficient, maybe an additional cumulative counter can be added too.

Failing that, a cumulative counter exposed via MBeans is fine. As an
operator, I can always deal with that in other tools, such as Prometheus.

On 09/03/2023 20:57, Abe Ratnofsky wrote:
> there's a point at which a host limping along is better put down and
replaced

I did a basic literature review and it looks like load (total
program-erase cycles), disk age, and operating temperature all lead to
BER increases. We don't need to build a whole model of disk failure,
we could probably get a lot of mileage out of a warn / failure
threshold for number of automatic corruption repairs.

Under this model, Cassandra could automatically repair X (3?)
corruption events before warning a user ("time to replace this host"),
and Y (10?) corruption events before forcing itself down.

But it would be good to get a better sense of user expectations here.
Bowen - how would you want Cassandra to handle frequent disk
corruption events?

--
Abe

On Mar 9, 2023, at 12:44 PM, Josh McKenzie wrote:

I'm not seeing any reasons why CEP-21 would make this more difficult
to implement
I think I communicated poorly - I was just trying to point out that
there's a point at which a host limping along is better put down and
replaced than piecemeal flagging range after range dead and working
around it, and there's no immediately obvious "Correct" answer to
where that point is regardless of what mechanism we're using to hold
a cluster-wide view of topology.

...CEP-21 makes this sequencing safe...
For sure - I wouldn't advocate for any kind of "automated corrupt
data repair" in a pre-CEP-21 world.

On Thu, Mar 9, 2023, at 2:56 PM, Abe Ratnofsky wrote:
I'm not seeing any reasons why CEP-21 would make this more difficult
to implement, besides the fact that it hasn't landed yet.

There are two major potential pitfalls that CEP-21 would help us avoid:
1. Bit-errors beget further bit-errors, so we ought to be resistant
to a high frequency of corruption events
2. Avoid token ownership changes when attempting to stream a
corrupted token

I found some data supporting (1) -
https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf

If we detect bit-errors and store them in system_distributed, then
we need a capacity to throttle that load and ensure that consistency
is maintained.

When we attempt to rectify any bit-error by streaming data from
peers, we implicitly take a lock on token ownership. A user needs to
know that it is unsafe to change token ownership in a cluster that
is currently in the process of repairing a corruption error on one
of its instances' disks. CEP-21 makes this sequencing safe, and
provides abstractions to better expose this information to operators.

--
Abe

On Mar 9, 2023, at 10:55 AM, Josh McKenzie
wrote:

Personally, I'd like to see the fix for thi

38 matches

Mail list logo