There is a small but measurable bit error rate in copying data around-
some RAM chips have a bit error per gB per century, others per hour.
HDFS itself has (I believe) a checksum, but moving the gigabytes
around is still vulnerable. At this point new file systems should
include optional checksums for all files.

http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

On Sat, Oct 16, 2010 at 1:31 PM, Raymond Jennings III
<[email protected]> wrote:
> I am curious if your data got corrupted when you transferred your file into
> HDFS?  I recently had a very similar situation to yours where I had about 5
> lines of decimal points getting corrupted.  When I transferred the file back 
> out
> of HDFS and compared it to the original is when I finally figured out what was
> wrong.  I don't have an answer for your specific question but am just curious 
> if
> you had experienced the same thing that I did.
>
>
>
>
> ________________________________
> From: Boyu Zhang <[email protected]>
> To: [email protected]; [email protected]
> Sent: Fri, October 15, 2010 5:02:08 PM
> Subject: Corrupted input data to map
>
> Hi all,
>
> I am running a program with input 1 million lines of data, among the 1
> million, 5 or 6 lines data are corrupted. The way the are corrupted is: in
> the position which a float number is expected, like 3.4 , instead of a float
> number, something like this is there: 3.4.5.6 . So when the map runs, it
> throws a multiple point in num exception.
>
> My question is: the map tasks that have the exception are marked failure,
> how about the data processed by the same map before the exception, do they
> reach the reduce task? or they are treated like garbage? Thank you very much
> any help is appreciated.
>
> Boyu
>
>
>
>



-- 
Lance Norskog
[email protected]

Reply via email to