Is the checksum pluggable? CRC-32 is good for error detection not for duplication check. I need this for duplication check.
On Sun, Jul 29, 2012 at 8:41 PM, Brock Noland <br...@cloudera.com> wrote: > Also note that HDFS already does checksums which I believe you > can retrieve: > > > http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path) > > http://hadoop.apache.org/common/docs/r1.0.3/hdfs_design.html#Data+Integrity > > Brock > > > On Sun, Jul 29, 2012 at 12:35 PM, Yaron Gonen <yaron.go...@gmail.com>wrote: > >> Thanks! >> I'll dig into those classes to figure out my next step. >> >> Anyway, I just realized the block-level compression has nothing to do >> with HDFS blocks. An HDFS block can contain an unknown number of compressed >> blocks, which makes my efforts kind of worthless. >> >> thanks again! >> >> >> On Sun, Jul 29, 2012 at 6:40 PM, Tim Broberg <tim.brob...@exar.com>wrote: >> >>> What if you wrote a CompressionOutputStream class that wraps around >>> the existing ones and outputs a hash per <n> bytes and a >>> CompressionInputStream that checks them? ...and a Codec that wraps your >>> compressors around arbitrary existing codecs. >>> >>> Sounds like a bunch of work, and I'm not sure where you would store >>> the hashes, but it would get the data into your clutches the instant it's >>> available. >>> >>> - Tim. >>> >>> On Jul 29, 2012, at 7:41 AM, "Yaron Gonen" <yaron.go...@gmail.com> >>> wrote: >>> >>> Hi, >>> I've created a SequeceFile.Writer with block-level compression. >>> I'd like to create a SHA1 hash for each block written. How do I do that? >>> I didn't see any way to take the compression under my control in order to >>> know when a block is over. >>> >>> Thanks, >>> Yaron >>> >>> >>> ------------------------------ >>> The information contained in this email is intended only for the >>> personal and confidential use of the recipient(s) named above. The >>> information and any attached documents contained in this message may be >>> Exar confidential and/or legally privileged. If you are not the intended >>> recipient, you are hereby notified that any review, use, dissemination or >>> reproduction of this message is strictly prohibited and may be unlawful. If >>> you have received this communication in error, please notify us immediately >>> by return email and delete the original message. >>> >> >> > > > -- > Apache MRUnit - Unit testing MapReduce - > http://incubator.apache.org/mrunit/ >