So I tried the same with .gz files and it worked. I am using the following hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I thought that hadoop0.20 does support bz2 compression, hence same should work with hive as well.
Interesting note is that Pig works fine on the same bz2 data. Is there any tweaking/config setup I need to do for hive to take bz2 files as input ? On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee < [email protected]> wrote: > I have a similar issue with bz2 files. I have the hadoop directories : > > /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt ) > /ip/datacompressed/ : containing same files bzipped ( foo1.bz2, foo2.bz2 ) > > > CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002' > LOCATION '/ip/datacompressed/'; > SELECT * FROM tx_log limit 1; > > The command works fine with LOCATION '/ip/data/' but doesnt work with > LOCATION '/ip/datacompressed/' > > Any pointers ? I thought ( like Pig ) hive automatically detects .bz2 > extensions and applies appropriate decompression. Am I wrong ? > > -Prasen > > > > On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <[email protected]> wrote: > >> I just corrected the wiki page. It will also be a good idea to support >> case-insensitive boolean values in the code. >> >> Zheng >> >> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <[email protected]> >> wrote: >> > Thanks Adam, that works for me as well. >> > It seems that the property for hive.exec.compress.output is case >> sensitive, >> > and when it is set to TRUE (as it is on the compressed storage page on >> the >> > wiki) it is ignored by hive. >> > >> > -Brent >> > >> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <[email protected]> >> wrote: >> >> >> >> Adding these to my hive-site.xml file worked fine: >> >> >> >> <property> >> >> <name>hive.exec.compress.output</name> >> >> <value>true</value> >> >> <description>Compress output</description> >> >> </property> >> >> >> >> <property> >> >> <name>mapred.output.compression.type</name> >> >> <value>BLOCK</value> >> >> <description>Block compression</description> >> >> </property> >> >> >> >> >> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller < >> [email protected]> >> >> wrote: >> >> > Hello, I've seen issues similar to this one come up once or twice >> >> > before, >> >> > but I haven't ever seen a solution to the problem that I'm having. I >> was >> >> > following the Compressed Storage page on the Hive >> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized >> that >> >> > the >> >> > sequence files that are created in the warehouse directory are >> actually >> >> > uncompressed and larger than than the originals. >> >> > For example, I have a table 'test1' who's input data looks something >> >> > like: >> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43 >> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43 >> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341 >> >> > ... >> >> > And after creating a second table 'test1_comp' that was crated with >> the >> >> > STORED AS SEQUENCEFILE directive and the compression options SET as >> >> > described in the wiki, I can look at the resultant sequence files and >> >> > see >> >> > that they're just plain (uncompressed) text: >> >> > SEQ "org.apache.hadoop.io.BytesWritable >> >> > org.apache.hadoop.io.Text+�c�!Y�M �� >> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43= >> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43= >> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341= >> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141= >> >> > ... >> >> > I've tried messing around with >> different org.apache.hadoop.io.compress.* >> >> > options, but the sequence files always come out uncompressed. Has >> >> > anybody >> >> > ever seen this or know away to keep the data compressed? Since the >> input >> >> > text is so uniform, we get huge space savings from compression and >> would >> >> > like to store the data this way if possible. I'm using Hadoop 20.1 >> and >> >> > Hive >> >> > that I checked out from SVN about a week ago. >> >> > Thanks, >> >> > Brent >> >> >> >> >> >> >> >> -- >> >> Adam J. O'Donnell, Ph.D. >> >> Immunet Corporation >> >> Cell: +1 (267) 251-0070 >> > >> > >> >> >> >> -- >> Yours, >> Zheng >> > >
