Re: Help with Compressed Storage

Zheng Shao Wed, 17 Feb 2010 22:22:33 -0800

There is no special setting for bz2.

Can you get the debug log?


Zheng

On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee
<pmukher...@quattrowireless.com> wrote:
> So I tried the same with  .gz files and it worked. I am using the following
> hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I thought
> that hadoop0.20 does support bz2 compression, hence same should work with
> hive as well.
>
> Interesting note is that Pig works fine on the same bz2 data.  Is there any
> tweaking/config setup I need to do for hive to take bz2 files as input ?
>
> On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee
> <pmukher...@quattrowireless.com> wrote:
>>
>> I have a similar issue with bz2 files. I have the hadoop directories :
>>
>> /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
>> /ip/datacompressed/ : containing same files bzipped (  foo1.bz2, foo2.bz2
>> )
>>
>> CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
>>    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
>>    LOCATION '/ip/datacompressed/';
>> SELECT *  FROM tx_log limit 1;
>>
>> The command works fine with LOCATION '/ip/data/' but doesnt work with
>> LOCATION '/ip/datacompressed/'
>>
>> Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
>> extensions and applies appropriate decompression. Am I wrong ?
>>
>> -Prasen
>>
>>
>> On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <zsh...@gmail.com> wrote:
>>>
>>> I just corrected the wiki page. It will also be a good idea to support
>>> case-insensitive boolean values in the code.
>>>
>>> Zheng
>>>
>>> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <brentalanmil...@gmail.com>
>>> wrote:
>>> > Thanks Adam, that works for me as well.
>>> > It seems that the property for hive.exec.compress.output is case
>>> > sensitive,
>>> > and when it is set to TRUE (as it is on the compressed storage page on
>>> > the
>>> > wiki) it is ignored by hive.
>>> >
>>> > -Brent
>>> >
>>> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <a...@immunet.com>
>>> > wrote:
>>> >>
>>> >> Adding these to my hive-site.xml file worked fine:
>>> >>
>>> >>  <property>
>>> >>        <name>hive.exec.compress.output</name>
>>> >>        <value>true</value>
>>> >>        <description>Compress output</description>
>>> >>  </property>
>>> >>
>>> >>  <property>
>>> >>        <name>mapred.output.compression.type</name>
>>> >>        <value>BLOCK</value>
>>> >>        <description>Block compression</description>
>>> >>  </property>
>>> >>
>>> >>
>>> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller
>>> >> <brentalanmil...@gmail.com>
>>> >> wrote:
>>> >> > Hello, I've seen issues similar to this one come up once or twice
>>> >> > before,
>>> >> > but I haven't ever seen a solution to the problem that I'm having. I
>>> >> > was
>>> >> > following the Compressed Storage page on the Hive
>>> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
>>> >> > that
>>> >> > the
>>> >> > sequence files that are created in the warehouse directory are
>>> >> > actually
>>> >> > uncompressed and larger than than the originals.
>>> >> > For example, I have a table 'test1' who's input data looks something
>>> >> > like:
>>> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
>>> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
>>> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
>>> >> > ...
>>> >> > And after creating a second table 'test1_comp' that was crated with
>>> >> > the
>>> >> > STORED AS SEQUENCEFILE directive and the compression options SET as
>>> >> > described in the wiki, I can look at the resultant sequence files
>>> >> > and
>>> >> > see
>>> >> > that they're just plain (uncompressed) text:
>>> >> > SEQ "org.apache.hadoop.io.BytesWritable
>>> >> > org.apache.hadoop.io.Text+�c�!Y�M ��
>>> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
>>> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
>>> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
>>> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
>>> >> > ...
>>> >> > I've tried messing around with
>>> >> > different org.apache.hadoop.io.compress.*
>>> >> > options, but the sequence files always come out uncompressed. Has
>>> >> > anybody
>>> >> > ever seen this or know away to keep the data compressed? Since the
>>> >> > input
>>> >> > text is so uniform, we get huge space savings from compression and
>>> >> > would
>>> >> > like to store the data this way if possible. I'm using Hadoop 20.1
>>> >> > and
>>> >> > Hive
>>> >> > that I checked out from SVN about a week ago.
>>> >> > Thanks,
>>> >> > Brent
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Adam J. O'Donnell, Ph.D.
>>> >> Immunet Corporation
>>> >> Cell: +1 (267) 251-0070
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Yours,
>>> Zheng
>>
>
>



-- 
Yours,
Zheng

Re: Help with Compressed Storage

Reply via email to