There is no special setting for bz2. Can you get the debug log?
Zheng On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee <pmukher...@quattrowireless.com> wrote: > So I tried the same with .gz files and it worked. I am using the following > hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I thought > that hadoop0.20 does support bz2 compression, hence same should work with > hive as well. > > Interesting note is that Pig works fine on the same bz2 data. Is there any > tweaking/config setup I need to do for hive to take bz2 files as input ? > > On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee > <pmukher...@quattrowireless.com> wrote: >> >> I have a similar issue with bz2 files. I have the hadoop directories : >> >> /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt ) >> /ip/datacompressed/ : containing same files bzipped ( foo1.bz2, foo2.bz2 >> ) >> >> CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string) >> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002' >> LOCATION '/ip/datacompressed/'; >> SELECT * FROM tx_log limit 1; >> >> The command works fine with LOCATION '/ip/data/' but doesnt work with >> LOCATION '/ip/datacompressed/' >> >> Any pointers ? I thought ( like Pig ) hive automatically detects .bz2 >> extensions and applies appropriate decompression. Am I wrong ? >> >> -Prasen >> >> >> On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <zsh...@gmail.com> wrote: >>> >>> I just corrected the wiki page. It will also be a good idea to support >>> case-insensitive boolean values in the code. >>> >>> Zheng >>> >>> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <brentalanmil...@gmail.com> >>> wrote: >>> > Thanks Adam, that works for me as well. >>> > It seems that the property for hive.exec.compress.output is case >>> > sensitive, >>> > and when it is set to TRUE (as it is on the compressed storage page on >>> > the >>> > wiki) it is ignored by hive. >>> > >>> > -Brent >>> > >>> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <a...@immunet.com> >>> > wrote: >>> >> >>> >> Adding these to my hive-site.xml file worked fine: >>> >> >>> >> <property> >>> >> <name>hive.exec.compress.output</name> >>> >> <value>true</value> >>> >> <description>Compress output</description> >>> >> </property> >>> >> >>> >> <property> >>> >> <name>mapred.output.compression.type</name> >>> >> <value>BLOCK</value> >>> >> <description>Block compression</description> >>> >> </property> >>> >> >>> >> >>> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller >>> >> <brentalanmil...@gmail.com> >>> >> wrote: >>> >> > Hello, I've seen issues similar to this one come up once or twice >>> >> > before, >>> >> > but I haven't ever seen a solution to the problem that I'm having. I >>> >> > was >>> >> > following the Compressed Storage page on the Hive >>> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized >>> >> > that >>> >> > the >>> >> > sequence files that are created in the warehouse directory are >>> >> > actually >>> >> > uncompressed and larger than than the originals. >>> >> > For example, I have a table 'test1' who's input data looks something >>> >> > like: >>> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43 >>> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43 >>> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341 >>> >> > ... >>> >> > And after creating a second table 'test1_comp' that was crated with >>> >> > the >>> >> > STORED AS SEQUENCEFILE directive and the compression options SET as >>> >> > described in the wiki, I can look at the resultant sequence files >>> >> > and >>> >> > see >>> >> > that they're just plain (uncompressed) text: >>> >> > SEQ "org.apache.hadoop.io.BytesWritable >>> >> > org.apache.hadoop.io.Text+�c�!Y�M �� >>> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43= >>> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43= >>> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341= >>> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141= >>> >> > ... >>> >> > I've tried messing around with >>> >> > different org.apache.hadoop.io.compress.* >>> >> > options, but the sequence files always come out uncompressed. Has >>> >> > anybody >>> >> > ever seen this or know away to keep the data compressed? Since the >>> >> > input >>> >> > text is so uniform, we get huge space savings from compression and >>> >> > would >>> >> > like to store the data this way if possible. I'm using Hadoop 20.1 >>> >> > and >>> >> > Hive >>> >> > that I checked out from SVN about a week ago. >>> >> > Thanks, >>> >> > Brent >>> >> >>> >> >>> >> >>> >> -- >>> >> Adam J. O'Donnell, Ph.D. >>> >> Immunet Corporation >>> >> Cell: +1 (267) 251-0070 >>> > >>> > >>> >>> >>> >>> -- >>> Yours, >>> Zheng >> > > -- Yours, Zheng