There is a small but measurable bit error rate in copying data around- some RAM chips have a bit error per gB per century, others per hour. HDFS itself has (I believe) a checksum, but moving the gigabytes around is still vulnerable. At this point new file systems should include optional checksums for all files.
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf On Sat, Oct 16, 2010 at 1:31 PM, Raymond Jennings III <[email protected]> wrote: > I am curious if your data got corrupted when you transferred your file into > HDFS? I recently had a very similar situation to yours where I had about 5 > lines of decimal points getting corrupted. When I transferred the file back > out > of HDFS and compared it to the original is when I finally figured out what was > wrong. I don't have an answer for your specific question but am just curious > if > you had experienced the same thing that I did. > > > > > ________________________________ > From: Boyu Zhang <[email protected]> > To: [email protected]; [email protected] > Sent: Fri, October 15, 2010 5:02:08 PM > Subject: Corrupted input data to map > > Hi all, > > I am running a program with input 1 million lines of data, among the 1 > million, 5 or 6 lines data are corrupted. The way the are corrupted is: in > the position which a float number is expected, like 3.4 , instead of a float > number, something like this is there: 3.4.5.6 . So when the map runs, it > throws a multiple point in num exception. > > My question is: the map tasks that have the exception are marked failure, > how about the data processed by the same map before the exception, do they > reach the reduce task? or they are treated like garbage? Thank you very much > any help is appreciated. > > Boyu > > > > -- Lance Norskog [email protected]
