Re: Help with Compressed Storage

Yongqiang He Tue, 16 Feb 2010 14:33:21 -0800

Like Zheng said,
Try set hive.exec.compress.output=true;
"set hive.exec.compress.intermediate=true" is not recommended because of the
cpu cost.


Also in some cases, set hive.merge.mapfiles = false; will help getting a
better compression.


On 2/16/10 2:04 PM, "Zheng Shao" <[email protected]> wrote:

> Try google "Hive compression":
> 
> See 
> http://svn.apache.org/viewvc/hadoop/hive/trunk/common/src/java/org/apache/hado
> op/hive/conf/HiveConf.java?p2=/hadoop/hive/trunk/common/src/java/org/apache/ha
> doop/hive/conf/HiveConf.java&p1=/hadoop/hive/trunk/common/src/java/org/apache/
> hadoop/hive/conf/HiveConf.java&r1=723687&r2=723686&view=diff&pathrev=723687
> 
>     COMPRESSRESULT("hive.exec.compress.output", false),
>     COMPRESSINTERMEDIATE("hive.exec.compress.intermediate", false),
> 
> Hive uses different compression parameters than hadoop.
> 
> Also, Hive support using different compressions for intermediate
> results. See https://issues.apache.org/jira/browse/HIVE-759
> 
> 
> Zheng
> 
> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <[email protected]>
> wrote:
>> Hello, I've seen issues similar to this one come up once or twice before,
>> but I haven't ever seen a solution to the problem that I'm having. I was
>> following the Compressed Storage page on the Hive
>> Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that the
>> sequence files that are created in the warehouse directory are actually
>> uncompressed and larger than than the originals.
>> For example, I have a table 'test1' who's input data looks something like:
>> 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
>> 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
>> 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
>> ...
>> And after creating a second table 'test1_comp' that was crated with the
>> STORED AS SEQUENCEFILE directive and the compression options SET as
>> described in the wiki, I can look at the resultant sequence files and see
>> that they're just plain (uncompressed) text:
>> SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text+�c�!Y�M ��
>> Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
>> 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
>> 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
>> 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
>> ...
>> I've tried messing around with different org.apache.hadoop.io.compress.*
>> options, but the sequence files always come out uncompressed. Has anybody
>> ever seen this or know away to keep the data compressed? Since the input
>> text is so uniform, we get huge space savings from compression and would
>> like to store the data this way if possible. I'm using Hadoop 20.1 and Hive
>> that I checked out from SVN about a week ago.
>> Thanks,
>> Brent
> 
>

Re: Help with Compressed Storage

Reply via email to