Hello, If I have two Sequence files (f1, and f2) that are converted from the same text file, then I would assume that they should contain the same content (i.e., "semantically equivalent"). In fact, if I do -text on f1 and f2 and diff the textual representation of f1 and f2, they are the same.
But when I do the md5sum on each block (stored on the local file system) of f1 and f2, I will get md5sum(f1.block) != md5sum(f2.block) for each block. I understand that there must be some magic numbers / metadata embedded in each block, thus the md5sum of the raw data won't match. So my question is if there is a way to tell if the contents of two blocks (or FileInputSplit for mappers) are the same ? Thanks! Wei