Re: [gentoo-user] Re: OT: Fighting bit rot
On Wed, 9 Jan 2013 02:47:07 + (UTC) Grant Edwards grant.b.edwa...@gmail.com wrote: ZFS is designed to deal with this problem by checksumming fs blocks continually; it does this at the filesystem level, not at the disk firmware level. I don't understand. If you're worried about media failure, what good does checksumming at the file level do when failing media produces seek/read errors rather than erroneous data? When the media fails, there is no data to checksum. Not file level - it's filesystem level. It checksums filesystem blocks. And we are not talking about failing media either, we are talking about media corruption. You appear to have conflated them. The data on a medium can corrupt, and it can corrupt silently for a long time. At some point it may deteriorate to where it passes a cusp and then you will get your first visible sign - read failure. You did not see anything that happened prior as it was silent. -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: OT: Fighting bit rot
On Jan 9, 2013 10:41 PM, Holger Hoffstaette holger.hoffstae...@googlemail.com wrote: On Wed, 09 Jan 2013 14:48:33 +, Grant Edwards wrote: On 2013-01-09, Alan McKinnon alan.mckin...@gmail.com wrote: On Wed, 9 Jan 2013 02:47:07 + (UTC) The data on a medium can corrupt, and it can corrupt silently for a long time. And I'm saying I've never seen that happen. Well, that's the point. http://storagemojo.com/2007/09/19/cerns-data-corruption-research/ Things are much worse than you think. -h But that link also shows a bright light: RAID arrays are in general more reliable, with BER less than one-third their spec. Rgds, --
Re: [gentoo-user] Re: OT: Fighting bit rot
Am Mittwoch, 9. Januar 2013, 07:17:25 schrieb walt: On 01/08/2013 08:40 PM, Volker Armin Hemmann wrote: Am Dienstag, 8. Januar 2013, 19:11:19 schrieb James: Volker Armin Hemmann volkerarmin at googlemail.com writes: Comments/guidance on ZFS vs BTFRS are welcome. I never used ZFS; googling suggests lots of disdain for ZFS ? Maybe someone knows a good article or wiki discussion where the various merits of the currently available file systems are presented? does btrfs support raid levels others than 1? zfs does. Is freaking easy to set up and to use. Can handle swap files and supports dedup. is not linux-only. Are you using the gentoo zfs and zfs-kmod packages to get zfs support? yes. Are they ready for prime time? they work for me. I don't use latest-and-greatest kernels and I use vanilla kernel.org sources. zpool status pool: zfstank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 3h23m with 9 errors on Sat Jan 5 05:47:34 2013 config: NAMESTATE READ WRITE CKSUM zfstank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-Hitachi_HDS5C3020ALA632_ML4230FA17X6EK ONLINE 0 0 0 ata-Hitachi_HDS5C3020ALA632_ML4230FA17X6HK ONLINE 0 0 0 ata-Hitachi_HDS5C3020ALA632_ML4230FA17X7YK ONLINE 0 0 0 errors: 9 data errors, use '-v' for a list those errors were caused by a memory glitch (and are video files... so... I don't even care about them - also they are still on two different backup media...), But zfs caught these errors. ext4? I really doubt it. -- #163933
Re: [gentoo-user] Re: OT: Fighting bit rot
On Wed, 9 Jan 2013 14:48:33 + (UTC) Grant Edwards grant.b.edwa...@gmail.com wrote: On 2013-01-09, Alan McKinnon alan.mckin...@gmail.com wrote: On Wed, 9 Jan 2013 02:47:07 + (UTC) The data on a medium can corrupt, and it can corrupt silently for a long time. And I'm saying I've never seen that happen. So you're saying that the data on a medium can corrupt without being detected by the block encodings and CRCs used by the disk controller? No, I'm not saying that *at*all* I've been saying all along that data which you never go near can corrupt silently while you are not using it. When you do eventually get around to reading it, the electronics will do what they are designed to do and properly detect a problem that already happened. At some point it may deteriorate to where it passes a cusp and then you will get your first visible sign No, the first visible sign in the scenario you're describing would be a read returning erroneous data. That's what I said. The first VISIBLE sign is an error. You want to catch it before then. Analogy time: A murderer plans to do Grant. By observing Grant and only observing Grant, the first visible sign of an issue is the death of Grant. Obviously this is sub-optimum and we should be looking at a few more things than just Grant in order to preserve Grant. - read failure. You did not see anything that happened prior as it was silent. If a read successfully returns correct data, how is it silent? I never used those words and never said successfully returns correct data. At best I said something equivalent to If a read returns. The point I'm trying hard to make is that all our fancy hardware merely gives an *apparency* of reliable results that are totally right or totally wrong. It looks that way because the IT industry spent much time creating wrappers and APIs to give that effect. Under the covers where the actual storage happens it is not like that, and errors can happen. They are rare. Lucky for us, these days we have precision machinery and clever mathematics that reduce the problem vastly. I know in my own case the electronics offer a reliability that far exceeds what I need so I can afford to ignore rare problems. Other people have different needs. -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: OT: Fighting bit rot
On Wed, 9 Jan 2013 14:48:33 + (UTC) Grant Edwards grant.b.edwa...@gmail.com wrote: On 2013-01-09, Alan McKinnon alan.mckin...@gmail.com wrote: On Wed, 9 Jan 2013 02:47:07 + (UTC) The data on a medium can corrupt, and it can corrupt silently for a long time. And I'm saying I've never seen that happen. So you're saying that the data on a medium can corrupt without being detected by the block encodings and CRCs used by the disk controller? No, I'm not saying that *at*all* I've been saying all along that data which you never go near can corrupt silently while you are not using it. When you do eventually get around to reading it, the electronics will do what they are designed to do and properly detect a problem that already happened. At some point it may deteriorate to where it passes a cusp and then you will get your first visible sign No, the first visible sign in the scenario you're describing would be a read returning erroneous data. That's what I said. The first VISIBLE sign is an error. You want to catch it before then. Analogy time: A murderer plans to do Grant. By observing Grant and only observing Grant, the first visible sign of an issue is the death of Grant. Obviously this is sub-optimum and we should be looking at a few more things than just Grant in order to preserve Grant. - read failure. You did not see anything that happened prior as it was silent. If a read successfully returns correct data, how is it silent? I never used those words and never said successfully returns correct data. At best I said something equivalent to If a read returns. The point I'm trying hard to make is that all our fancy hardware merely gives an *apparency* of reliable results that are totally right or totally wrong. It looks that way because the IT industry spent much time creating wrappers and APIs to give that effect. Under the covers where the actual storage happens it is not like that, and errors can happen. They are rare. Lucky for us, these days we have precision machinery and clever mathematics that reduce the problem vastly. I know in my own case the electronics offer a reliability that far exceeds what I need so I can afford to ignore rare problems. Other people have different needs. -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: OT: Fighting bit rot
On Tue, Jan 8, 2013 at 10:29 AM, Grant Edwards grant.b.edwa...@gmail.com wrote: On 2013-01-08, Florian Philipp li...@binarywings.net wrote: Am 08.01.2013 00:20, schrieb Alan McKinnon: On Mon, 07 Jan 2013 21:11:35 +0100 Florian Philipp li...@binarywings.net wrote: Hi list! I have a use case where I am seriously concerned about bit rot [1] and I thought it might be a good idea to start looking for it in my own private stuff, too. [...] [1] http://en.wikipedia.org/wiki/Bit_rot Regards, Florian Philipp You are using a very peculiar definition of bitrot. bits do not rot, they are not apples in a barrel. Bitrot usually refers to code that goes unmaintained and no longer works in the system it was installed. What definition are you using? That's why I referred to wikipedia, not the jargon file ;-) The wikipedia page to which you refer has _two_ definitions. The uncommon on you're using: http://en.wikipedia.org/wiki/Bit_rot#Decay_of_storage_media and the the common one: http://en.wikipedia.org/wiki/Bit_rot#Problems_with_software I've heard the term bit rot for decades, but I've never heard the decay of storage media usage. It's always referred to unmaintained code that no longer words because of changes to tools or the surrounding environment. Frankly, I'd heard of bitrot first as applying to decay of storage media. But this was back when your average storage media decay (floppies and early hard disks) was expected to happen within months, if not weeks. The term's applying to software utility being damaged by assumptions about its platform is a far, far newer application of the term. I still think of crappy media and errors in transmission before I think of platform compatibility decay. -- :wq
Re: [gentoo-user] Re: OT: Fighting bit rot
Am 08.01.2013 16:42, schrieb Michael Mol: On Tue, Jan 8, 2013 at 10:29 AM, Grant Edwards grant.b.edwa...@gmail.com wrote: On 2013-01-08, Florian Philipp li...@binarywings.net wrote: Am 08.01.2013 00:20, schrieb Alan McKinnon: On Mon, 07 Jan 2013 21:11:35 +0100 Florian Philipp li...@binarywings.net wrote: Hi list! I have a use case where I am seriously concerned about bit rot [1] and I thought it might be a good idea to start looking for it in my own private stuff, too. [...] [1] http://en.wikipedia.org/wiki/Bit_rot Regards, Florian Philipp You are using a very peculiar definition of bitrot. bits do not rot, they are not apples in a barrel. Bitrot usually refers to code that goes unmaintained and no longer works in the system it was installed. What definition are you using? That's why I referred to wikipedia, not the jargon file ;-) The wikipedia page to which you refer has _two_ definitions. The uncommon on you're using: http://en.wikipedia.org/wiki/Bit_rot#Decay_of_storage_media and the the common one: http://en.wikipedia.org/wiki/Bit_rot#Problems_with_software I've heard the term bit rot for decades, but I've never heard the decay of storage media usage. It's always referred to unmaintained code that no longer words because of changes to tools or the surrounding environment. Frankly, I'd heard of bitrot first as applying to decay of storage media. But this was back when your average storage media decay (floppies and early hard disks) was expected to happen within months, if not weeks. The term's applying to software utility being damaged by assumptions about its platform is a far, far newer application of the term. I still think of crappy media and errors in transmission before I think of platform compatibility decay. -- :wq Google Scholar and Google Search have both usages on the first page of their search results for bit rot. So let's agree that both forms are common depending on the context. Next time, when I write about Fighting bugs I'll make it clear if I'm dealing with an infestation of critters. Regards, Florian Philipp signature.asc Description: OpenPGP digital signature
Re: [gentoo-user] Re: OT: Fighting bit rot
Am 08.01.2013 20:53, schrieb Grant Edwards: On 2013-01-08, Pandu Poluan pa...@poluan.info wrote: On Jan 8, 2013 11:20 PM, Florian Philipp li...@binarywings.net wrote: -- snip -- Hmm, good idea, albeit similar to the `md5sum -c`. Either tool leaves you with the problem of distinguishing between legitimate changes (i.e. a user wrote to the file) and decay. When you have completely static content, md5sum, rsync and friends are sufficient. But if you have content that changes from time to time, the number of false-positives would be too high. In this case, I think you could easily distinguish by comparing both file content and time stamps. Now, that of course introduces the problem that decay could occur in the same time frame as a legitimate change, thus masking the decay. To reduce this risk, you have to reduce the checking interval. Regards, Florian Philipp IMO, we're all barking up the wrong tree here... Before a file's content can change without user involvement, bit rot must first get through the checksum (CRC?) of the hard disk itself. There will be no 'gradual degradation of data', just 'catastrophic data loss'. When a hard drive starts to fail, you don't unknowingly get back rotten data with some bits flipped. You get either a seek error or read error, and no data at all. IIRC, the same is true for attempts to read a failing CD. However, if you've got failing RAM that doesn't have hardware ECC, that often appears as corrupted data in files. If a bit gets erroneously flippped in a RAM page that's being used to cache file data, and that page is marked as dirty, then the erroneous bits will get written back to disk just like the rest of them. Related: The guys in [1] observed md5sums of data and noticed all kinds of issues: bit rot, temporary controller issues and so on. [1] Schwarz et.al: Disk Failure Investigations at the Internet Archive http://www.hpl.hp.com/personal/Mary_Baker/publications/wip.pdf Regards, Florian Philipp signature.asc Description: OpenPGP digital signature
Re: [gentoo-user] Re: OT: Fighting bit rot
On Tue, 8 Jan 2013 19:53:41 + (UTC) Grant Edwards grant.b.edwa...@gmail.com wrote: On 2013-01-08, Pandu Poluan pa...@poluan.info wrote: On Jan 8, 2013 11:20 PM, Florian Philipp li...@binarywings.net wrote: -- snip -- Hmm, good idea, albeit similar to the `md5sum -c`. Either tool leaves you with the problem of distinguishing between legitimate changes (i.e. a user wrote to the file) and decay. When you have completely static content, md5sum, rsync and friends are sufficient. But if you have content that changes from time to time, the number of false-positives would be too high. In this case, I think you could easily distinguish by comparing both file content and time stamps. Now, that of course introduces the problem that decay could occur in the same time frame as a legitimate change, thus masking the decay. To reduce this risk, you have to reduce the checking interval. Regards, Florian Philipp IMO, we're all barking up the wrong tree here... Before a file's content can change without user involvement, bit rot must first get through the checksum (CRC?) of the hard disk itself. There will be no 'gradual degradation of data', just 'catastrophic data loss'. When a hard drive starts to fail, you don't unknowingly get back rotten data with some bits flipped. You get either a seek error or read error, and no data at all. IIRC, the same is true for attempts to read a failing CD. I see what Florian is getting at here, and he's perfectly correct. We techie types often like to think our storage is purely binary, the cells are either on or off and they never change unless we deliberately make them change. We think this way because we wrap our storage in layers to make it look that way, in the style of an API. The truth is that our storage is subject to decay. Harddrives are magnetic at heart, and atoms have to align and stay aligned for the drive to work. Floppies are infinitely worse at this, but drives are not immune. Writeable CDs do not have physical pits and lands like factory original discs have, they use chemicals to make reflective and non-reflective spots. The list of points of corruption is long and they all happen after the data has been committed to physical storage. Worse, you only know about the corruption by reading it, there is no other way to discover if the medium and the data are still OK. He wants to read the medium occasionally and verify it while the backups are still usable, and not wait for the point of no return - the read error from a medium that long since failed. Maybe Florian's data is valuable enough to warrant worth the effort. I know mine isn't, but his might be. -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: OT: Fighting bit rot
On Tue, 8 Jan 2013 22:15:15 + (UTC) Grant Edwards grant.b.edwa...@gmail.com wrote: IMO, having backup data _is_ very valuable, but regularly reading files and comparing them to backup copies isn't a useful way to detect failing media. He doesn't suggest you compare the live data to a backup. He suggests you compare the current checksum to the last known (presumed or verified as good) checksum, and if they are different then deal with it. deal with it likely involves a restore after some kind of verify process. I agree that comparing current data with a backup is pretty pointless - you don't know which is the bad one if they differ. ZFS is designed to deal with this problem by checksumming fs blocks continually; it does this at the filesystem level, not at the disk firmware level. Pity about the license incompatibility, it's a great fs. -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: OT: Fighting bit rot
Am Dienstag, 8. Januar 2013, 19:11:19 schrieb James: Volker Armin Hemmann volkerarmin at googlemail.com writes: Comments/guidance on ZFS vs BTFRS are welcome. I never used ZFS; googling suggests lots of disdain for ZFS ? Maybe someone knows a good article or wiki discussion where the various merits of the currently available file systems are presented? does btrfs support raid levels others than 1? zfs does. Is freaking easy to set up and to use. Can handle swap files and supports dedup. is not linux-only. -- #163933