[jira] Created: (HADOOP-374) native support for gzipped text files

Yoram Arnon (JIRA) Thu, 20 Jul 2006 12:52:26 -0700

native support for gzipped text files 
--------------------------------------


                 Key: HADOOP-374
                 URL: http://issues.apache.org/jira/browse/HADOOP-374
             Project: Hadoop
          Issue Type: New Feature
          Components: mapred
            Reporter: Yoram Arnon


in many cases it is convenient to store text files in dfs as gzip compressed 
files.
It would be good to have built in support for processing these files in a 
mapreduce job.

The getSplits implementation should return a single split per input file, 
ignoring the numSplits parameter.
One can probably subclass InputFormatBase, and the getSplits method can simply 
call listPaths() 
and then construct and return a single split per path returned.

The code for reading would look something like (courtesy of Vijay Murthy):

   public RecordReader getRecordReader(FileSystem fs, FileSplit split,
                                       JobConf job, Reporter reporter)
     throws IOException {
     final BufferedReader in =
       new BufferedReader(new InputStreamReader
         (new GZIPInputStream(fs.open(split.getPath()))));
     return new RecordReader() {
         long position;
         public synchronized boolean next(Writable key, Writable value)
           throws IOException {
           String line = in.readLine();
           if (line != null) {
             position += line.length();
             ((UTF8)value).set(line);
             return true;
           }
           return false;
         }
         public synchronized long getPos() throws IOException {
           return position;
         }
        public synchronized void close() throws IOException {
           in.close();
         }
       };
   }



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (HADOOP-374) native support for gzipped text files

Reply via email to