Just remember that we need to have the BZipCodec class in the following hadoop configuration: Can you check?
io.compression.codecs Zheng On Wed, Feb 17, 2010 at 11:21 PM, prasenjit mukherjee <[email protected]> wrote: > So this is the command I ran, first with with small.gz (which worked fine) > and then with small.bz2 ( which didnt work ) : > > drop table small_table; > CREATE TABLE small_table(id1 string, id2 string, id3 string) ROW FORMAT > DELIMITED FIELDS TERMINATED BY ','; > LOAD DATA LOCAL INPATH '/root/data/small.gz' OVERWRITE INTO TABLE > small_table; > select * from small_table limit 1; > > For gz files I do see the following lines in hive_debug : > 10/02/18 01:59:23 DEBUG ipc.RPC: Call: getBlockLocations 1 > 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Trying to load the > custom-built native-hadoop library... > 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Failed to load native-hadoop > with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path > 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: > java.library.path=/usr/java/jdk1.6.0_14/jre/lib/amd64/server:/usr/java/jdk1.6.0_14/jre/lib/amd64:/usr/java/jdk1.6.0_14/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib > 10/02/18 01:59:23 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 10/02/18 01:59:23 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0 > offsetInBlock 0 lastPacketInBlock true packetLen 88 > aid1 bid2 cid3 > > But for bzip files there is none : > 10/02/18 01:57:18 DEBUG ipc.RPC: Call: getBlockLocations 2 > 10/02/18 01:57:18 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0 > offsetInBlock 0 lastPacketInBlock true packetLen 85 > 10/02/18 01:57:18 WARN lazy.LazyStruct: Missing fields! Expected 3 fields > but only got 1! Ignoring similar problems. > BZh91AY&SYǧ �"Y @ >< TP?* �"��SFL� c����ѶѶ�$� � > �w��U�)„�=8O� > NULL NULL > > > Let me know if you still need the debug files. Attached are the small.gz and > small.bzip2 files. > > Thanks and appreciate, > -Prasen > > On Thu, Feb 18, 2010 at 11:52 AM, Zheng Shao <[email protected]> wrote: >> >> There is no special setting for bz2. >> >> Can you get the debug log? >> >> Zheng >> >> On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee >> <[email protected]> wrote: >> > So I tried the same with .gz files and it worked. I am using the >> > following >> > hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I >> > thought >> > that hadoop0.20 does support bz2 compression, hence same should work >> > with >> > hive as well. >> > >> > Interesting note is that Pig works fine on the same bz2 data. Is there >> > any >> > tweaking/config setup I need to do for hive to take bz2 files as input ? >> > >> > On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee >> > <[email protected]> wrote: >> >> >> >> I have a similar issue with bz2 files. I have the hadoop directories : >> >> >> >> /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt ) >> >> /ip/datacompressed/ : containing same files bzipped ( foo1.bz2, >> >> foo2.bz2 >> >> ) >> >> >> >> CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string) >> >> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002' >> >> LOCATION '/ip/datacompressed/'; >> >> SELECT * FROM tx_log limit 1; >> >> >> >> The command works fine with LOCATION '/ip/data/' but doesnt work with >> >> LOCATION '/ip/datacompressed/' >> >> >> >> Any pointers ? I thought ( like Pig ) hive automatically detects .bz2 >> >> extensions and applies appropriate decompression. Am I wrong ? >> >> >> >> -Prasen >> >> >> >> >> >> On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <[email protected]> wrote: >> >>> >> >>> I just corrected the wiki page. It will also be a good idea to support >> >>> case-insensitive boolean values in the code. >> >>> >> >>> Zheng >> >>> >> >>> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller >> >>> <[email protected]> >> >>> wrote: >> >>> > Thanks Adam, that works for me as well. >> >>> > It seems that the property for hive.exec.compress.output is case >> >>> > sensitive, >> >>> > and when it is set to TRUE (as it is on the compressed storage page >> >>> > on >> >>> > the >> >>> > wiki) it is ignored by hive. >> >>> > >> >>> > -Brent >> >>> > >> >>> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <[email protected]> >> >>> > wrote: >> >>> >> >> >>> >> Adding these to my hive-site.xml file worked fine: >> >>> >> >> >>> >> <property> >> >>> >> <name>hive.exec.compress.output</name> >> >>> >> <value>true</value> >> >>> >> <description>Compress output</description> >> >>> >> </property> >> >>> >> >> >>> >> <property> >> >>> >> <name>mapred.output.compression.type</name> >> >>> >> <value>BLOCK</value> >> >>> >> <description>Block compression</description> >> >>> >> </property> >> >>> >> >> >>> >> >> >>> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller >> >>> >> <[email protected]> >> >>> >> wrote: >> >>> >> > Hello, I've seen issues similar to this one come up once or twice >> >>> >> > before, >> >>> >> > but I haven't ever seen a solution to the problem that I'm >> >>> >> > having. I >> >>> >> > was >> >>> >> > following the Compressed Storage page on the Hive >> >>> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized >> >>> >> > that >> >>> >> > the >> >>> >> > sequence files that are created in the warehouse directory are >> >>> >> > actually >> >>> >> > uncompressed and larger than than the originals. >> >>> >> > For example, I have a table 'test1' who's input data looks >> >>> >> > something >> >>> >> > like: >> >>> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43 >> >>> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43 >> >>> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341 >> >>> >> > ... >> >>> >> > And after creating a second table 'test1_comp' that was crated >> >>> >> > with >> >>> >> > the >> >>> >> > STORED AS SEQUENCEFILE directive and the compression options SET >> >>> >> > as >> >>> >> > described in the wiki, I can look at the resultant sequence files >> >>> >> > and >> >>> >> > see >> >>> >> > that they're just plain (uncompressed) text: >> >>> >> > SEQ "org.apache.hadoop.io.BytesWritable >> >>> >> > org.apache.hadoop.io.Text+�c�!Y�M �� >> >>> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43= >> >>> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43= >> >>> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341= >> >>> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141= >> >>> >> > ... >> >>> >> > I've tried messing around with >> >>> >> > different org.apache.hadoop.io.compress.* >> >>> >> > options, but the sequence files always come out uncompressed. Has >> >>> >> > anybody >> >>> >> > ever seen this or know away to keep the data compressed? Since >> >>> >> > the >> >>> >> > input >> >>> >> > text is so uniform, we get huge space savings from compression >> >>> >> > and >> >>> >> > would >> >>> >> > like to store the data this way if possible. I'm using Hadoop >> >>> >> > 20.1 >> >>> >> > and >> >>> >> > Hive >> >>> >> > that I checked out from SVN about a week ago. >> >>> >> > Thanks, >> >>> >> > Brent >> >>> >> >> >>> >> >> >>> >> >> >>> >> -- >> >>> >> Adam J. O'Donnell, Ph.D. >> >>> >> Immunet Corporation >> >>> >> Cell: +1 (267) 251-0070 >> >>> > >> >>> > >> >>> >> >>> >> >>> >> >>> -- >> >>> Yours, >> >>> Zheng >> >> >> > >> > >> >> >> >> -- >> Yours, >> Zheng > > -- Yours, Zheng
