I have a similar issue with bz2 files. I have the hadoop directories : /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt ) /ip/datacompressed/ : containing same files bzipped ( foo1.bz2, foo2.bz2 )
CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002' LOCATION '/ip/datacompressed/'; SELECT * FROM tx_log limit 1; The command works fine with LOCATION '/ip/data/' but doesnt work with LOCATION '/ip/datacompressed/' Any pointers ? I thought ( like Pig ) hive automatically detects .bz2 extensions and applies appropriate decompression. Am I wrong ? -Prasen On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <[email protected]> wrote: > I just corrected the wiki page. It will also be a good idea to support > case-insensitive boolean values in the code. > > Zheng > > On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <[email protected]> > wrote: > > Thanks Adam, that works for me as well. > > It seems that the property for hive.exec.compress.output is case > sensitive, > > and when it is set to TRUE (as it is on the compressed storage page on > the > > wiki) it is ignored by hive. > > > > -Brent > > > > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <[email protected]> > wrote: > >> > >> Adding these to my hive-site.xml file worked fine: > >> > >> <property> > >> <name>hive.exec.compress.output</name> > >> <value>true</value> > >> <description>Compress output</description> > >> </property> > >> > >> <property> > >> <name>mapred.output.compression.type</name> > >> <value>BLOCK</value> > >> <description>Block compression</description> > >> </property> > >> > >> > >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller < > [email protected]> > >> wrote: > >> > Hello, I've seen issues similar to this one come up once or twice > >> > before, > >> > but I haven't ever seen a solution to the problem that I'm having. I > was > >> > following the Compressed Storage page on the Hive > >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized > that > >> > the > >> > sequence files that are created in the warehouse directory are > actually > >> > uncompressed and larger than than the originals. > >> > For example, I have a table 'test1' who's input data looks something > >> > like: > >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43 > >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43 > >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341 > >> > ... > >> > And after creating a second table 'test1_comp' that was crated with > the > >> > STORED AS SEQUENCEFILE directive and the compression options SET as > >> > described in the wiki, I can look at the resultant sequence files and > >> > see > >> > that they're just plain (uncompressed) text: > >> > SEQ "org.apache.hadoop.io.BytesWritable > >> > org.apache.hadoop.io.Text+�c�!Y�M �� > >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43= > >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43= > >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341= > >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141= > >> > ... > >> > I've tried messing around with > different org.apache.hadoop.io.compress.* > >> > options, but the sequence files always come out uncompressed. Has > >> > anybody > >> > ever seen this or know away to keep the data compressed? Since the > input > >> > text is so uniform, we get huge space savings from compression and > would > >> > like to store the data this way if possible. I'm using Hadoop 20.1 and > >> > Hive > >> > that I checked out from SVN about a week ago. > >> > Thanks, > >> > Brent > >> > >> > >> > >> -- > >> Adam J. O'Donnell, Ph.D. > >> Immunet Corporation > >> Cell: +1 (267) 251-0070 > > > > > > > > -- > Yours, > Zheng >
