Have not checked gzip out yet but Hive is happy with .bz2 files. The documentation on this is spotty. It seems that any Hadoop supported compression will work. The issue with .gz files is that they will not be splittable. That is one map will process an entire file so if your .gz files are large and you have more map capability than files you will not be able to make use of it.

On Jul 24, 2009 10:09am, Saurabh Nanda <[email protected]> wrote:
Please excuse my ignorance, but can I import gzip compressed files directly as Hive tables? I have separate gzip files for each days weblog data. Right now I am gunzipping them and then importing into a raw table. Can I import the gzipped files directly into Hive?


Saurabh.

On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [email protected]> wrote:

I don't think these are splittable. Compression on sequencefiles is splittable across sequencefile blocks.



Ashish




-----Original Message-----

From: Bill Craig [mailto:[email protected]]

Sent: Tuesday, July 21, 2009 8:06 AM

To: [email protected]

Subject: bz2 Splits.



I loaded 5 files of bzip2 compressed data into a table in Hive. Three are small test files containing 10,000 records. Two were large ~8Gb compressed.

When I run a query against the table I see three tasks that complete almost immediately and two tasks that run for a very long time. It appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some old mails about this, but could not find any resolution for this problem. I compressed the files using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop


0.19.1 r745977






--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Reply via email to