Re: HADOOP-4012 and bzip2 input splitting

김영우 Thu, 22 Apr 2010 00:11:00 -0700

Zheng,

It's 'org.apache.hadoop.hive.ql.io.HiveInputFormat'. and I don't know
exactly MAPREDUCE-830 is in CDH3. but I could not find any clues.


Thanks for your help.

- Youngwoo

2010/4/22 Zheng Shao <[email protected]>

> Can you take a look at the "job.xml" link in your map-reduce job
> created by Hive and let me know the mapred.input.format.class?
> Is it HiveInputFormat or CombineHiveInputFormat?
>
> It should work if you set it to
> org.apache.hadoop.hive.ql.io.HiveInputFormat
>
> Also, can you verify if
> https://issues.apache.org/jira/browse/MAPREDUCE-830 is in your hadoop
> distribution or not?
>
> Zheng
>
> On Wed, Apr 21, 2010 at 11:31 PM, 김영우 <[email protected]> wrote:
> > Zeng,
> >
> > Thanks for your quick reply. but there is only 1 mapper for my job with
> 300
> > MB, bz2 file.
> >
> > I added the following in my core-site.xml
> >
> > <property>
> > <name>io.compression.codecs</name>
> >
> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
> > </property>
> >
> > My table definition:
> >
> > create table test_bzip2
> > (
> > co1 string,
> > .
> > .
> >
> > col20 string
> > )
> > row format delimited
> > fields terminated by '\t'
> > stored as textfile;
> >
> > A simple grouping/count query and the following is the query's plan:
> > STAGE PLANS:
> >   Stage: Stage-1
> >     Map Reduce
> >       Alias -> Map Operator Tree:
> >         test_bzip2
> >           TableScan
> >             alias: test_bzip2
> >             Select Operator
> >               expressions:
> >                     expr: siteid
> >                     type: string
> >               outputColumnNames: siteid
> >               Reduce Output Operator
> >                 key expressions:
> >                       expr: siteid
> >                       type: string
> >                 sort order: +
> >                 Map-reduce partition columns:
> >                       expr: siteid
> >                       type: string
> >                 tag: -1
> >                 value expressions:
> >                       expr: 1
> >                       type: int
> >       Reduce Operator Tree:
> >         Group By Operator
> >           aggregations:
> >                 expr: count(VALUE._col0)
> >           bucketGroup: false
> >           keys:
> >                 expr: KEY._col0
> >                 type: string
> >           mode: complete
> >           outputColumnNames: _col0, _col1
> >           Select Operator
> >             expressions:
> >                   expr: _col0
> >                   type: string
> >                   expr: _col1
> >                   type: bigint
> >             outputColumnNames: _col0, _col1
> >             File Output Operator
> >               compressed: false
> >               GlobalTableId: 0
> >               table:
> >                   input format: org.apache.hadoop.mapred.TextInputFormat
> >                   output format:
> > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> >
> >   Stage: Stage-0
> >     Fetch Operator
> >       limit: -1
> >
> >
> > I just verified bz2 splitting working in my cluster using a simple pig
> > script. the pig script makes 3 mapper for M/R job.
> >
> > What should I check further? Job config info?
> >
> > - Youngwoo
> >
> > 2010/4/22 Zheng Shao <[email protected]>
> >>
> >> It should be automatically supported. You don't need to do anything
> >> except adding the bzip2 codec in io.compression.codecs in hadoop
> >> configuration files (core-site.xml)
> >>
> >> Zheng
> >>
> >> On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <[email protected]> wrote:
> >> > Hi,
> >> >
> >> > HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has
> been
> >> > committed. and CHD3 supports bzip2 splitting.
> >> > I'm wondering if Hive supports input splitting for bzip2 compreesed
> text
> >> > file(*.bz2). If not, Should I implement a custom SerDe for bzip2
> >> > compressed
> >> > files?
> >> >
> >> > Thanks,
> >> > Youngwoo
> >> >
> >>
> >>
> >>
> >> --
> >> Yours,
> >> Zheng
> >> http://www.linkedin.com/in/zshao
> >
> >
>
>
>
> --
> Yours,
> Zheng
> http://www.linkedin.com/in/zshao
>

Re: HADOOP-4012 and bzip2 input splitting

Reply via email to