Re: HADOOP-4012 and bzip2 input splitting

김영우 Wed, 21 Apr 2010 23:32:22 -0700

Zeng,

Thanks for your quick reply. but there is only 1 mapper for my job with 300
MB, bz2 file.


I added the following in my core-site.xml

<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>

My table definition:

create table test_bzip2
(
co1 string,
.
.

col20 string
)
row format delimited
fields terminated by '\t'
stored as textfile;

A simple grouping/count query and the following is the query's plan:
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        test_bzip2
          TableScan
            alias: test_bzip2
            Select Operator
              expressions:
                    expr: siteid
                    type: string
              outputColumnNames: siteid
              Reduce Output Operator
                key expressions:
                      expr: siteid
                      type: string
                sort order: +
                Map-reduce partition columns:
                      expr: siteid
                      type: string
                tag: -1
                value expressions:
                      expr: 1
                      type: int
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: complete
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1


I just verified bz2 splitting working in my cluster using a simple pig
script. the pig script makes 3 mapper for M/R job.

What should I check further? Job config info?

- Youngwoo

2010/4/22 Zheng Shao <[email protected]>

> It should be automatically supported. You don't need to do anything
> except adding the bzip2 codec in io.compression.codecs in hadoop
> configuration files (core-site.xml)
>
> Zheng
>
> On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <[email protected]> wrote:
> > Hi,
> >
> > HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
> > committed. and CHD3 supports bzip2 splitting.
> > I'm wondering if Hive supports input splitting for bzip2 compreesed text
> > file(*.bz2). If not, Should I implement a custom SerDe for bzip2
> compressed
> > files?
> >
> > Thanks,
> > Youngwoo
> >
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>

Re: HADOOP-4012 and bzip2 input splitting

Reply via email to