Yes you have to deal with the compression. Usually, you'll load the compression codec in your RecordReader. You can see an example of how TextInputFormat's LineRecordReader does it:
https://github.com/apache/hadoop-common/blob/release-1.0.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java -Joey On Wed, Mar 14, 2012 at 11:08 AM, Tony Burton <tbur...@sportingindex.com> wrote: > Hi - sorry to bump this, but I'm having trouble resolving this. > > Essentially the question is: If I create my own InputFormat by subclassing > TextInputFormat, does the subclass have to handle its own streaming of > compressed data? If so, can anyone point me at an example where this is done? > > Thanks! > > Tony > > > > > > > > -----Original Message----- > From: Tony Burton [mailto:tbur...@sportingindex.com] > Sent: 12 March 2012 18:05 > To: common-user@hadoop.apache.org > Subject: decompressing bzip2 data with a custom InputFormat > > Hi, > > I'm setting up a map-only job that reads large bzip2-compressed data files, > parses the XML and writes out the same data in plain text format. My XML > InputFormat extends TextInputFormat and has a RecordReader based upon the one > you can see at http://xmlandhadoop.blogspot.com/ (my version of it works > great for uncompressed XML input data). For compressed data, I've added > io.compression.codecs to my core-site.xml and set it to > o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2. > > Have I forgotten something basic when running a Hadoop job to read compressed > data? Or, given that I've written my own InputFormat, should I be using an > InputStream that can carry out the decompression itself? > > Thanks > > Tony > > ********************************************************************** > This email and any attachments are confidential, protected by copyright and > may be legally privileged. If you are not the intended recipient, then the > dissemination or copying of this email is prohibited. If you have received > this in error, please notify the sender by replying by email and then delete > the email completely from your system. Neither Sporting Index nor the sender > accepts responsibility for any virus, or any other defect which might affect > any computer or IT system into which the email is received and/or opened. It > is the responsibility of the recipient to scan the email and no > responsibility is accepted for any loss or damage arising in any way from > receipt or use of this email. Sporting Index Ltd is a company registered in > England and Wales with company number 2636842, whose registered office is at > Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting > Index Ltd is authorised and regulated by the UK Financial Services Authority > (reg. no. 150404). Any financial promotion contained herein has been issued > and approved by Sporting Index Ltd. > > Outbound email has been scanned for viruses and SPAM > www.sportingindex.com > Inbound Email has been scanned for viruses and SPAM > ********************************************************************** > This email and any attachments are confidential, protected by copyright and > may be legally privileged. If you are not the intended recipient, then the > dissemination or copying of this email is prohibited. If you have received > this in error, please notify the sender by replying by email and then delete > the email completely from your system. Neither Sporting Index nor the sender > accepts responsibility for any virus, or any other defect which might affect > any computer or IT system into which the email is received and/or opened. It > is the responsibility of the recipient to scan the email and no > responsibility is accepted for any loss or damage arising in any way from > receipt or use of this email. Sporting Index Ltd is a company registered in > England and Wales with company number 2636842, whose registered office is at > Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES. Sporting > Index Ltd is authorised and regulated by the UK Financial Services Authority > (reg. no. 150404). Any financial promotion contained herein has been issued > and approved by Sporting Index Ltd. > > Outbound email has been scanned for viruses and SPAM -- Joseph Echeverria Cloudera, Inc. 443.305.9434