It is possible but a little tricky. As I mention before,write a custom InputFormat and the associate RecordReader.
On Wed, Apr 11, 2012 at 5:23 PM, Grzegorz Gunia <sawt...@student.agh.edu.pl>wrote: > I think we misunderstood here. > > I'll base my question upon an example: > Lets say I want each of the files stored on my hdfs to be encrypted prior > to being physically stored on the cluster. > For that I'll write a custom CompressionCodec, that performs the > encryption, and use it during any edits/creations of files in the HDFS. > Then to make it more secure I'll make it so it uses different keys for > different files, and supply the keys to the codec during its instantiation. > > Now I'd like to do a MapReduce job on those files. That would require > instantiating the codec, and supplying it with the filename, to determine > the key used. Is it possible to do so with the current implementation of > Hadoop? > > -- > Greg > > W dniu 2012-04-11 10:44, Zizon Qiu pisze: > > If your are: > 1. using TextInputFormat. > 2.all input files are ends with certain suffix like ".gz" > 3.the custom CompressionCodec already register in configuration and > getDefaultExtension return the same suffix like as describe in 2. > > the nothing else you need to do. > hadoop will deal with it automatically. > > that means the input key&value in map method are already decompress. > > But,if the origin files dose not end with certain suffix,you need > to write your own inputformat or subclass TextInputFormat , override the > createRecordReader method which return your own RecordReader. > the InputSplit pass to the InputFormat is actually FileInputSplit,which > you can retrieve the input file path. > > you may also take a look at the isSplitable method declared > in InputFormat,if your files are not splitable. > > for more detail,refer to the TextInputFormat class implementation. > > On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia < > sawt...@student.agh.edu.pl> wrote: > >> Thanks for you reply! That clears some thing up >> There is but one problem... My CompressionCodec has to be instantiated on >> a per-file basis, meaning it needs to know the name of the file it is to >> compress/decompress. I'm guessing that would not be possible with the >> current implementation? >> >> Or if so, how would I proceed with injecting it with the file name? >> -- >> Greg >> >> W dniu 2012-04-11 10:12, Zizon Qiu pisze: >> >> append your custom codec full class name in "io.compression.codecs" >> either in mapred-site.xml or in the configuration object pass to Job >> constructor. >> >> the map reduce framework will try to guess the compress algorithm using >> the input files suffix. >> >> if any CompressionCodec.getDefaultExtension() register in the >> configuration match the suffix,hadoop will try to instantiate the codec and >> decompress for you ,if succeed,automatically. >> >> the default value for "io.compression.codecs" is >> "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec" >> >> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia < >> sawt...@student.agh.edu.pl> wrote: >> >>> Hello, >>> I am trying to apply a custom CompressionCodec to work with MapReduce >>> jobs, but I haven't found a way to inject it during the reading of input >>> data, or during the write of the job results. >>> Am I missing something, or is there no support for compressed files in >>> the filesystem? >>> >>> I am well aware of how to set it up to work during the intermitent >>> phases of the MapReduce operation, but I just can't find a way to apply it >>> BEFORE the job takes place... >>> Is there any other way except simply uncompressing the files I need >>> prior to scheduling a job? >>> >>> Huge thanks for any help you can give me! >>> -- >>> Greg >>> >> >> >> > >