Re: CompressionCodec in MapReduce

Grzegorz Gunia Wed, 11 Apr 2012 02:24:06 -0700

I think we misunderstood here.

I'll base my question upon an example:

Lets say I want each of the files stored on my hdfs to be encryptedprior to being physically stored on the cluster.For that I'll write a custom CompressionCodec, that performs theencryption, and use it during any edits/creations of files in the HDFS.Then to make it more secure I'll make it so it uses different keys fordifferent files, and supply the keys to the codec during its instantiation.

Now I'd like to do a MapReduce job on those files. That would requireinstantiating the codec, and supplying it with the filename, todetermine the key used. Is it possible to do so with the currentimplementation of Hadoop?


--
Greg

W dniu 2012-04-11 10:44, Zizon Qiu pisze:

If your are:
1. using TextInputFormat.
2.all input files are ends with certain suffix like ".gz"

3.the custom CompressionCodec already register in configuration andgetDefaultExtension return the same suffix like as describe in 2.


the nothing else you need to do.
hadoop will deal with it automatically.

that means the input key&value in map method are already decompress.

But,if the origin files dose not end with certain suffix,you needto write your own inputformat or subclass TextInputFormat , overridethe createRecordReader method which return your own RecordReader.the InputSplit pass to the InputFormat is actuallyFileInputSplit,which you can retrieve the input file path.

you may also take a look at the isSplitable method declaredin InputFormat,if your files are not splitable.


for more detail,refer to the TextInputFormat class implementation.

On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia<sawt...@student.agh.edu.pl <mailto:sawt...@student.agh.edu.pl>> wrote:


    Thanks for you reply! That clears some thing up
    There is but one problem... My CompressionCodec has to be
    instantiated on a per-file basis, meaning it needs to know the
    name of the file it is to compress/decompress. I'm guessing that
    would not be possible with the current implementation?

    Or if so, how would I proceed with injecting it with the file name?
    --
    Greg

    W dniu 2012-04-11 10:12, Zizon Qiu pisze:

    append your custom codec full class name in
    "io.compression.codecs" either in mapred-site.xml or in the
    configuration object pass to Job constructor.

    the map reduce framework will try to guess the compress algorithm
    using the input files suffix.

    if any CompressionCodec.getDefaultExtension() register in the
    configuration match the suffix,hadoop will try to instantiate the
    codec and decompress for you ,if succeed,automatically.

    the default value for "io.compression.codecs" is
    
"org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"

    On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia
    <sawt...@student.agh.edu.pl <mailto:sawt...@student.agh.edu.pl>>
    wrote:

        Hello,
        I am trying to apply a custom CompressionCodec to work with
        MapReduce jobs, but I haven't found a way to inject it during
        the reading of input data, or during the write of the job
        results.
        Am I missing something, or is there no support for compressed
        files in the filesystem?

        I am well aware of how to set it up to work during the
        intermitent phases of the MapReduce operation, but I just
        can't find a way to apply it BEFORE the job takes place...
        Is there any other way except simply uncompressing the files
        I need prior to scheduling a job?

        Huge thanks for any help you can give me!
        --
        Greg

Re: CompressionCodec in MapReduce

Reply via email to