Yes you have to deal with the compression. Usually, you'll load the
compression codec in your RecordReader. You can see an example of how
TextInputFormat's LineRecordReader does it:

https://github.com/apache/hadoop-common/blob/release-1.0.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java

-Joey

On Wed, Mar 14, 2012 at 11:08 AM, Tony Burton <tbur...@sportingindex.com> wrote:
> Hi - sorry to bump this, but I'm having trouble resolving this.
>
> Essentially the question is: If I create my own InputFormat by subclassing 
> TextInputFormat, does the subclass have to handle its own streaming of 
> compressed data? If so, can anyone point me at an example where this is done?
>
> Thanks!
>
> Tony
>
>
>
>
>
>
>
> -----Original Message-----
> From: Tony Burton [mailto:tbur...@sportingindex.com]
> Sent: 12 March 2012 18:05
> To: common-user@hadoop.apache.org
> Subject: decompressing bzip2 data with a custom InputFormat
>
>  Hi,
>
> I'm setting up a map-only job that reads large bzip2-compressed data files, 
> parses the XML and writes out the same data in plain text format. My XML 
> InputFormat extends TextInputFormat and has a RecordReader based upon the one 
> you can see at http://xmlandhadoop.blogspot.com/ (my version of it works 
> great for uncompressed XML input data). For compressed data, I've added 
> io.compression.codecs to my core-site.xml and set it to 
> o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.
>
> Have I forgotten something basic when running a Hadoop job to read compressed 
> data? Or, given that I've written my own InputFormat, should I be using an 
> InputStream that can carry out the decompression itself?
>
> Thanks
>
> Tony
>
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged.  If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system.  Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened.  It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email.  Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
> Index Ltd is authorised and regulated by the UK Financial Services Authority 
> (reg. no. 150404). Any financial promotion contained herein has been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Reply via email to