Re: Re: bz2 Splits.

Zheng Shao Sat, 25 Jul 2009 04:00:40 -0700

Hi Saurabh,

Can you help put that information into appropriate place on the wiki
(where you see fit)?
Thanks for the help.

By the way, I guess we need to debug what went wrong with the
"count(1)" queries. There is definitely something going wrong.

For the timing, how much mapper slots do you have in your cluster?

I think you might want to consider this:

Approach #3:
a) import gzip files into textfile table
b) set hive.exec.compress.output to true
c) inserted into sequencefile table
This will create bigger sequencefiles which will help reducing the
overhead. This is better than Approach #2 because jobs from the
sequencefile tables will have more mappers.

Zheng

On Sat, Jul 25, 2009 at 3:48 AM, Saurabh Nanda<[email protected]> wrote:
>
>> TextFile means the plain text file (records delimited by "\n").
>> Compressed TextFiles are just text files compressed by gzip or bzip2
>> utility. SequenceFile is a special file format that only Hadoop can
>> understand.
>> Since your files are compressed TextFiles, you have to create a table
>> with TextFile format, in order to load the data without any
>> conversion.
>> (Compression is detected automatically for both TextFile and
>> SequenceFile - you don't need to specify it when creating a table)
>
> This really clears things up. I guess adding a note in the Wiki will put an
> end to the confusion permanently. A little note on the approach (compressed
> textfile vs compressed sequencefile) with the best performance would also be
> appreciated.
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>

-- 
Yours,
Zheng

Re: Re: bz2 Splits.

Reply via email to