Re: HADOOP-4012 and bzip2 input splitting

Zheng Shao Wed, 21 Apr 2010 23:50:45 -0700

Can you take a look at the "job.xml" link in your map-reduce job
created by Hive and let me know the mapred.input.format.class?
Is it HiveInputFormat or CombineHiveInputFormat?


It should work if you set it to org.apache.hadoop.hive.ql.io.HiveInputFormat

Also, can you verify if
https://issues.apache.org/jira/browse/MAPREDUCE-830 is in your hadoop
distribution or not?

Zheng

On Wed, Apr 21, 2010 at 11:31 PM, 김영우 <warwit...@gmail.com> wrote:
> Zeng,
>
> Thanks for your quick reply. but there is only 1 mapper for my job with 300
> MB, bz2 file.
>
> I added the following in my core-site.xml
>
> <property>
> <name>io.compression.codecs</name>
> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
> </property>
>
> My table definition:
>
> create table test_bzip2
> (
> co1 string,
> .
> .
>
> col20 string
> )
> row format delimited
> fields terminated by '\t'
> stored as textfile;
>
> A simple grouping/count query and the following is the query's plan:
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         test_bzip2
>           TableScan
>             alias: test_bzip2
>             Select Operator
>               expressions:
>                     expr: siteid
>                     type: string
>               outputColumnNames: siteid
>               Reduce Output Operator
>                 key expressions:
>                       expr: siteid
>                       type: string
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: siteid
>                       type: string
>                 tag: -1
>                 value expressions:
>                       expr: 1
>                       type: int
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: count(VALUE._col0)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>           mode: complete
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>
>
> I just verified bz2 splitting working in my cluster using a simple pig
> script. the pig script makes 3 mapper for M/R job.
>
> What should I check further? Job config info?
>
> - Youngwoo
>
> 2010/4/22 Zheng Shao <zsh...@gmail.com>
>>
>> It should be automatically supported. You don't need to do anything
>> except adding the bzip2 codec in io.compression.codecs in hadoop
>> configuration files (core-site.xml)
>>
>> Zheng
>>
>> On Wed, Apr 21, 2010 at 10:15 PM, 김영우 <warwit...@gmail.com> wrote:
>> > Hi,
>> >
>> > HADOOP-4012, https://issues.apache.org/jira/browse/HADOOP-4012 has been
>> > committed. and CHD3 supports bzip2 splitting.
>> > I'm wondering if Hive supports input splitting for bzip2 compreesed text
>> > file(*.bz2). If not, Should I implement a custom SerDe for bzip2
>> > compressed
>> > files?
>> >
>> > Thanks,
>> > Youngwoo
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>> http://www.linkedin.com/in/zshao
>
>



-- 
Yours,
Zheng
http://www.linkedin.com/in/zshao

Re: HADOOP-4012 and bzip2 input splitting

Reply via email to