[jira] [Closed] (TS-4242) Permanent disk failures are not handled gracefully

Bryan Call (JIRA) Wed, 17 Aug 2016 10:50:30 -0700

     [ 
https://issues.apache.org/jira/browse/TS-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Call closed TS-4242.
--------------------------
    Resolution: Won't Fix

We discussed at the bug scrub and we don't want to deal with managing bad 
sector lists in traffic server.  The disk should hopefully do this and if there 
are failures at the application layer the disk should be replaced.

> Permanent disk failures are not handled gracefully
> --------------------------------------------------
>
>                 Key: TS-4242
>                 URL: https://issues.apache.org/jira/browse/TS-4242
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Cache
>            Reporter: Luca Bruno
>
> I'm simulating a disk failure of 1 sector with the following setup:
> {noformat}
> dd if=/dev/zero of=err.img bs=512 count=2097152
> losetup /dev/loop0 err.img
> dmsetup create err0 <<EOF
> 0 1024000 linear /dev/loop0 0
> 1024000 1 error
> 1024001 1073151 linear /dev/loop0 1024001
> EOF
> dmsetup mknodes err0
> {noformat}
> With the above command, we create a 1Gib disk, and at 500mib we simulate an 
> error for a single 512bytes sector.
> storage.config:
> {noformat}
> /dev/mapper/err0
> {noformat}
> Now I have a tool that randomly generates urls, stores them, and requests 
> them back with a certain probability. So that I both write and read from the 
> disk with a certain offered/expected hit ratio.
> Once I hit the 500mib mark, trafficserver keeps spitting warnings about disk 
> error. I fear it's because trafficserver keeps writing that bad sector 
> instead of skipping it.
> These are the errors/warnings I'm seeing in the log repeatedly:
> {noformat}
> [Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING: <AIO.cc:410 
> (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING: <Cache.cc:2089 
> (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/100000000]
> [Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING: <CacheRead.cc:1011 
> (openReadStartHead)> Head : Doc magic does not match for 
> 75B41B1A2C85AE637DD6CE368BF783D0
> [Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING: <CacheRead.cc:1011 
> (openReadStartHead)> Head : Doc magic does not match for 
> 1075CEA6E2E47496BE190DBB448B0B64
> ...
> [Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING: <AIO.cc:410 
> (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING: <Cache.cc:2089 
> (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1725/100000000]
> [Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING: <CacheRead.cc:1011 
> (openReadStartHead)> Head : Doc magic does not match for 
> 7E3325870F5488955118359E6C4B10F4
> [Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING: <CacheRead.cc:1011 
> (openReadStartHead)> Head : Doc magic does not match for 
> 7AE309F21ABF9B3774C67921018FCA0E
> ...
> {noformat}
> Summary: trafficserver does not treat I/O errors as permanent, but as 
> temporary. Is this true? This leads to either:
> 1. Replace the hard disk
> 2. Use a devicemapper to skip the bad sector.
> Both cases lead to throwing away a whole disk cache of terabytes for just a 
> bad sector.
> If this is what's really happening, is it feasible to skip the bad sector? If 
> so, I could work on a patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TS-4242) Permanent disk failures are not handled gracefully

Reply via email to