Hello,

If I have two Sequence files (f1, and f2) that are converted from the same
text file, then I would assume that they should contain the same content
(i.e., "semantically equivalent").  In fact, if I do -text on f1 and f2 and
diff the textual representation of f1 and f2, they are the same.

But when I do the md5sum on each block (stored on the local file system) of
f1 and f2, I will get md5sum(f1.block) != md5sum(f2.block) for each block.
I understand that there must be some magic numbers / metadata embedded in
each block, thus the md5sum of the raw data won't match.

So my question is if there is a way to tell if the contents of two blocks
(or FileInputSplit for mappers) are the same ?

Thanks!

Wei

Reply via email to