Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-09 Thread Alan McKinnon
On Wed, 9 Jan 2013 02:47:07 + (UTC)
Grant Edwards grant.b.edwa...@gmail.com wrote:

  ZFS is designed to deal with this problem by checksumming fs blocks
  continually; it does this at the filesystem level, not at the disk
  firmware level.  
 
 I don't understand.  If you're worried about media failure, what good
 does checksumming at the file level do when failing media produces
 seek/read errors rather than erroneous data?  When the media fails,
 there is no data to checksum.

Not file level - it's filesystem level. It checksums filesystem blocks.

And we are not talking about failing media either, we are talking about
media corruption. You appear to have conflated them.

The data on a medium can corrupt, and it can corrupt silently for a
long time. At some point it may deteriorate to where it passes a cusp
and then you will get your first visible sign - read failure. You did
not see anything that happened prior as it was silent.

-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-09 Thread Pandu Poluan
On Jan 9, 2013 10:41 PM, Holger Hoffstaette 
holger.hoffstae...@googlemail.com wrote:

 On Wed, 09 Jan 2013 14:48:33 +, Grant Edwards wrote:

  On 2013-01-09, Alan McKinnon alan.mckin...@gmail.com wrote:
  On Wed, 9 Jan 2013 02:47:07 + (UTC)
 
  The data on a medium can corrupt, and it can corrupt silently for a
long
  time.
 
  And I'm saying I've never seen that happen.

 Well, that's the point.

 http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

 Things are much worse than you think.

 -h


But that link also shows a bright light: RAID arrays are in general more
reliable, with BER less than one-third their spec.

Rgds,
--


Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-09 Thread Volker Armin Hemmann
Am Mittwoch, 9. Januar 2013, 07:17:25 schrieb walt:
 On 01/08/2013 08:40 PM, Volker Armin Hemmann wrote:
  Am Dienstag, 8. Januar 2013, 19:11:19 schrieb James:
  Volker Armin Hemmann volkerarmin at googlemail.com writes:
  
  Comments/guidance on ZFS vs BTFRS are welcome. I never used ZFS; googling
  suggests lots of disdain for ZFS ? Maybe someone knows a good article
  or wiki discussion where the various merits of the currently available
  file
  systems are presented?
  
  does btrfs support raid levels others than 1?
  
  zfs does. Is freaking easy to set up and to use. Can handle swap files and
  supports dedup.
  is not linux-only.
 
 Are you using the gentoo zfs and zfs-kmod packages to get zfs support?

yes.

 Are
 they ready for prime time?

they work for me. I don't use latest-and-greatest kernels and I use vanilla 
kernel.org sources. 

zpool status
  pool: zfstank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
 scan: scrub repaired 0 in 3h23m with 9 errors on Sat Jan  5 05:47:34 2013
config:

NAMESTATE READ WRITE 
CKSUM
zfstank ONLINE   0 0
 
0
  raidz1-0  ONLINE   0 0
 
0
ata-Hitachi_HDS5C3020ALA632_ML4230FA17X6EK  ONLINE   0 0
 
0
ata-Hitachi_HDS5C3020ALA632_ML4230FA17X6HK  ONLINE   0 0
 
0
ata-Hitachi_HDS5C3020ALA632_ML4230FA17X7YK  ONLINE   0 0
 
0

errors: 9 data errors, use '-v' for a list

those errors were caused by a memory glitch (and are video files... so... I 
don't even care about them - also they are still on two different backup 
media...), But zfs caught these errors. ext4? I really doubt it. 

-- 
#163933



Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-09 Thread Alan McKinnon
On Wed, 9 Jan 2013 14:48:33 + (UTC)
Grant Edwards grant.b.edwa...@gmail.com wrote:

 On 2013-01-09, Alan McKinnon alan.mckin...@gmail.com wrote:
  On Wed, 9 Jan 2013 02:47:07 + (UTC)
 
  The data on a medium can corrupt, and it can corrupt silently for a
  long time.
 
 And I'm saying I've never seen that happen.
 
 So you're saying that the data on a medium can corrupt without being
 detected by the block encodings and CRCs used by the disk controller?

No, I'm not saying that *at*all*

I've been saying all along that data which you never go near can
corrupt silently while you are not using it. When you do eventually get
around to reading it, the electronics will do what they are designed to
do and properly detect a problem that already happened.

 
  At some point it may deteriorate to where it passes a cusp
  and then you will get your first visible sign
 
 No, the first visible sign in the scenario you're describing would be
 a read returning erroneous data. 

That's what I said. The first VISIBLE sign is an error. You want to
catch it before then.

Analogy time: A murderer plans to do Grant. By observing Grant and only
observing Grant, the first visible sign of an issue is the death of
Grant. Obviously this is sub-optimum and we should be looking at a few
more things than just Grant in order to preserve Grant. 

 
  - read failure. You did not see anything that happened prior as it
  was silent.
 
 If a read successfully returns correct data, how is it silent?
 
I never used those words and never said successfully returns correct
data. At best I said something equivalent to If a read returns.

The point I'm trying hard to make is that all our fancy hardware merely
gives an *apparency* of reliable results that are totally right or
totally wrong. It looks that way because the IT industry spent much
time creating wrappers and APIs to give that effect. Under the covers
where the actual storage happens it is not like that, and errors can
happen. They are rare.

Lucky for us, these days we have precision machinery and clever
mathematics that reduce the problem vastly. I know in my own case the
electronics offer a reliability that far exceeds what I need so I can
afford to ignore rare problems. Other people have different needs.


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-09 Thread Alan McKinnon
On Wed, 9 Jan 2013 14:48:33 + (UTC)
Grant Edwards grant.b.edwa...@gmail.com wrote:

 On 2013-01-09, Alan McKinnon alan.mckin...@gmail.com wrote:
  On Wed, 9 Jan 2013 02:47:07 + (UTC)
 
  The data on a medium can corrupt, and it can corrupt silently for a
  long time.
 
 And I'm saying I've never seen that happen.
 
 So you're saying that the data on a medium can corrupt without being
 detected by the block encodings and CRCs used by the disk controller?

No, I'm not saying that *at*all*

I've been saying all along that data which you never go near can
corrupt silently while you are not using it. When you do eventually get
around to reading it, the electronics will do what they are designed to
do and properly detect a problem that already happened.

 
  At some point it may deteriorate to where it passes a cusp
  and then you will get your first visible sign
 
 No, the first visible sign in the scenario you're describing would be
 a read returning erroneous data. 

That's what I said. The first VISIBLE sign is an error. You want to
catch it before then.

Analogy time: A murderer plans to do Grant. By observing Grant and only
observing Grant, the first visible sign of an issue is the death of
Grant. Obviously this is sub-optimum and we should be looking at a few
more things than just Grant in order to preserve Grant. 

 
  - read failure. You did not see anything that happened prior as it
  was silent.
 
 If a read successfully returns correct data, how is it silent?
 
I never used those words and never said successfully returns correct
data. At best I said something equivalent to If a read returns.

The point I'm trying hard to make is that all our fancy hardware merely
gives an *apparency* of reliable results that are totally right or
totally wrong. It looks that way because the IT industry spent much
time creating wrappers and APIs to give that effect. Under the covers
where the actual storage happens it is not like that, and errors can
happen. They are rare.

Lucky for us, these days we have precision machinery and clever
mathematics that reduce the problem vastly. I know in my own case the
electronics offer a reliability that far exceeds what I need so I can
afford to ignore rare problems. Other people have different needs.


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-08 Thread Michael Mol
On Tue, Jan 8, 2013 at 10:29 AM, Grant Edwards
grant.b.edwa...@gmail.com wrote:
 On 2013-01-08, Florian Philipp li...@binarywings.net wrote:
 Am 08.01.2013 00:20, schrieb Alan McKinnon:
 On Mon, 07 Jan 2013 21:11:35 +0100
 Florian Philipp li...@binarywings.net wrote:

 Hi list!

 I have a use case where I am seriously concerned about bit rot [1]
 and I thought it might be a good idea to start looking for it in my
 own private stuff, too.
 [...]
 [1] http://en.wikipedia.org/wiki/Bit_rot

 Regards,
 Florian Philipp


 You are using a very peculiar definition of bitrot.

 bits do not rot, they are not apples in a barrel. Bitrot usually
 refers to code that goes unmaintained and no longer works in the
 system it was installed. What definition are you using?

 That's why I referred to wikipedia, not the jargon file ;-)

 The wikipedia page to which you refer has _two_ definitions.  The
 uncommon on you're using:

  http://en.wikipedia.org/wiki/Bit_rot#Decay_of_storage_media

   and the the common one:

  http://en.wikipedia.org/wiki/Bit_rot#Problems_with_software

 I've heard the term bit rot for decades, but I've never heard the
 decay of storage media usage.  It's always referred to unmaintained
 code that no longer words because of changes to tools or the
 surrounding environment.

Frankly, I'd heard of bitrot first as applying to decay of storage
media. But this was back when your average storage media decay
(floppies and early hard disks) was expected to happen within months,
if not weeks.

The term's applying to software utility being damaged by assumptions
about its platform is a far, far newer application of the term. I
still think of crappy media and errors in transmission before I
think of platform compatibility decay.

--
:wq



Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-08 Thread Florian Philipp
Am 08.01.2013 16:42, schrieb Michael Mol:
 On Tue, Jan 8, 2013 at 10:29 AM, Grant Edwards
 grant.b.edwa...@gmail.com wrote:
 On 2013-01-08, Florian Philipp li...@binarywings.net wrote:
 Am 08.01.2013 00:20, schrieb Alan McKinnon:
 On Mon, 07 Jan 2013 21:11:35 +0100
 Florian Philipp li...@binarywings.net wrote:

 Hi list!

 I have a use case where I am seriously concerned about bit rot [1]
 and I thought it might be a good idea to start looking for it in my
 own private stuff, too.
 [...]
 [1] http://en.wikipedia.org/wiki/Bit_rot

 Regards,
 Florian Philipp


 You are using a very peculiar definition of bitrot.

 bits do not rot, they are not apples in a barrel. Bitrot usually
 refers to code that goes unmaintained and no longer works in the
 system it was installed. What definition are you using?

 That's why I referred to wikipedia, not the jargon file ;-)

 The wikipedia page to which you refer has _two_ definitions.  The
 uncommon on you're using:

  http://en.wikipedia.org/wiki/Bit_rot#Decay_of_storage_media

   and the the common one:

  http://en.wikipedia.org/wiki/Bit_rot#Problems_with_software

 I've heard the term bit rot for decades, but I've never heard the
 decay of storage media usage.  It's always referred to unmaintained
 code that no longer words because of changes to tools or the
 surrounding environment.
 
 Frankly, I'd heard of bitrot first as applying to decay of storage
 media. But this was back when your average storage media decay
 (floppies and early hard disks) was expected to happen within months,
 if not weeks.
 
 The term's applying to software utility being damaged by assumptions
 about its platform is a far, far newer application of the term. I
 still think of crappy media and errors in transmission before I
 think of platform compatibility decay.
 
 --
 :wq
 

Google Scholar and Google Search have both usages on the first page of
their search results for bit rot. So let's agree that both forms are
common depending on the context. Next time, when I write about Fighting
bugs I'll make it clear if I'm dealing with an infestation of critters.

Regards,
Florian Philipp



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-08 Thread Florian Philipp
Am 08.01.2013 20:53, schrieb Grant Edwards:
 On 2013-01-08, Pandu Poluan pa...@poluan.info wrote:
 On Jan 8, 2013 11:20 PM, Florian Philipp li...@binarywings.net wrote:


 -- snip --


 Hmm, good idea, albeit similar to the `md5sum -c`. Either tool leaves
 you with the problem of distinguishing between legitimate changes (i.e.
 a user wrote to the file) and decay.

 When you have completely static content, md5sum, rsync and friends are
 sufficient. But if you have content that changes from time to time, the
 number of false-positives would be too high. In this case, I think you
 could easily distinguish by comparing both file content and time stamps.

 Now, that of course introduces the problem that decay could occur in the
 same time frame as a legitimate change, thus masking the decay. To
 reduce this risk, you have to reduce the checking interval.

 Regards,
 Florian Philipp

 IMO, we're all barking up the wrong tree here...

 Before a file's content can change without user involvement, bit rot must
 first get through the checksum (CRC?) of the hard disk itself. There will
 be no 'gradual degradation of data', just 'catastrophic data loss'.
 
 When a hard drive starts to fail, you don't unknowingly get back
 rotten data with some bits flipped.  You get either a seek error
 or read error, and no data at all.  IIRC, the same is true for
 attempts to read a failing CD.
 
 However, if you've got failing RAM that doesn't have hardware ECC,
 that often appears as corrupted data in files.  If a bit gets
 erroneously flippped in a RAM page that's being used to cache file
 data, and that page is marked as dirty, then the erroneous bits will
 get written back to disk just like the rest of them.
 

Related: The guys in [1] observed md5sums of data and noticed all kinds
of issues: bit rot, temporary controller issues and so on.

[1] Schwarz et.al: Disk Failure Investigations at the Internet Archive
http://www.hpl.hp.com/personal/Mary_Baker/publications/wip.pdf

Regards,
Florian Philipp




signature.asc
Description: OpenPGP digital signature


Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-08 Thread Alan McKinnon
On Tue, 8 Jan 2013 19:53:41 + (UTC)
Grant Edwards grant.b.edwa...@gmail.com wrote:

 On 2013-01-08, Pandu Poluan pa...@poluan.info wrote:
  On Jan 8, 2013 11:20 PM, Florian Philipp li...@binarywings.net
  wrote:
 
 
  -- snip --
 
 
  Hmm, good idea, albeit similar to the `md5sum -c`. Either tool
  leaves you with the problem of distinguishing between legitimate
  changes (i.e. a user wrote to the file) and decay.
 
  When you have completely static content, md5sum, rsync and friends
  are sufficient. But if you have content that changes from time to
  time, the number of false-positives would be too high. In this
  case, I think you could easily distinguish by comparing both file
  content and time stamps.
 
  Now, that of course introduces the problem that decay could occur
  in the same time frame as a legitimate change, thus masking the
  decay. To reduce this risk, you have to reduce the checking
  interval.
 
  Regards,
  Florian Philipp
 
  IMO, we're all barking up the wrong tree here...
 
  Before a file's content can change without user involvement, bit
  rot must first get through the checksum (CRC?) of the hard disk
  itself. There will be no 'gradual degradation of data', just
  'catastrophic data loss'.
 
 When a hard drive starts to fail, you don't unknowingly get back
 rotten data with some bits flipped.  You get either a seek error
 or read error, and no data at all.  IIRC, the same is true for
 attempts to read a failing CD.

I see what Florian is getting at here, and he's perfectly correct.

We techie types often like to think our storage is purely binary, the
cells are either on or off and they never change unless we
deliberately make them change. We think this way because we wrap our
storage in layers to make it look that way, in the style of an API.


The truth is that our storage is subject to decay. Harddrives are
magnetic at heart, and atoms have to align and stay aligned for the
drive to work. Floppies are infinitely worse at this, but drives are
not immune. Writeable CDs do not have physical pits and lands like
factory original discs have, they use chemicals to make reflective and
non-reflective spots. The list of points of corruption is long and
they all happen after the data has been committed to physical storage.

Worse, you only know about the corruption by reading it, there is no
other way to discover if the medium and the data are still OK. He wants
to read the medium occasionally and verify it while the backups are
still usable, and not wait for the point of no return - the read error
from a medium that long since failed.

Maybe Florian's data is valuable enough to warrant worth the effort. I
know mine isn't, but his might be.


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-08 Thread Alan McKinnon
On Tue, 8 Jan 2013 22:15:15 + (UTC)
Grant Edwards grant.b.edwa...@gmail.com wrote:

 IMO, having backup data _is_ very valuable, but regularly reading
 files and comparing them to backup copies isn't a useful way to detect
 failing media.

He doesn't suggest you compare the live data to a backup. He suggests
you compare the current checksum to the last known (presumed or
verified as good) checksum, and if they are different then deal with it.

deal with it likely involves a restore after some kind of verify
process.

I agree that comparing current data with a backup is pretty pointless -
you don't know which is the bad one if they differ.

ZFS is designed to deal with this problem by checksumming fs blocks
continually; it does this at the filesystem level, not at the disk
firmware level. Pity about the license incompatibility, it's a great fs.

-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: OT: Fighting bit rot

2013-01-08 Thread Volker Armin Hemmann
Am Dienstag, 8. Januar 2013, 19:11:19 schrieb James:
 Volker Armin Hemmann volkerarmin at googlemail.com writes:

 Comments/guidance on ZFS vs BTFRS are welcome. I never used ZFS; googling
 suggests lots of disdain for ZFS ? Maybe someone knows a good article
 or wiki discussion where the various merits of the currently available file
 systems are presented?

does btrfs support raid levels others than 1? 

zfs does. Is freaking easy to set up and to use. Can handle swap files and 
supports dedup. 
is not linux-only.
-- 
#163933