Have not checked gzip out yet but Hive is happy with .bz2 files. The
documentation on this is spotty. It seems that any Hadoop supported
compression will work. The issue with .gz files is that they will not be
splittable. That is one map will process an entire file so if your .gz
files are large and you have more map capability than files you will not be
able to make use of it.
On Jul 24, 2009 10:09am, Saurabh Nanda <[email protected]> wrote:
Please excuse my ignorance, but can I import gzip compressed files
directly as Hive tables? I have separate gzip files for each days weblog
data. Right now I am gunzipping them and then importing into a raw table.
Can I import the gzipped files directly into Hive?
Saurabh.
On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [email protected]>
wrote:
I don't think these are splittable. Compression on sequencefiles is
splittable across sequencefile blocks.
Ashish
-----Original Message-----
From: Bill Craig [mailto:[email protected]]
Sent: Tuesday, July 21, 2009 8:06 AM
To: [email protected]
Subject: bz2 Splits.
I loaded 5 files of bzip2 compressed data into a table in Hive. Three are
small test files containing 10,000 records. Two were large ~8Gb
compressed.
When I run a query against the table I see three tasks that complete
almost immediately and two tasks that run for a very long time. It
appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I
have seen some old mails about this, but could not find any resolution
for this problem. I compressed the files using the Apache bz2 jar, the
file are named *.bz2. I am using Hadoop
0.19.1 r745977
--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com