Re: Re: bz2 Splits.

bcraig7 Fri, 24 Jul 2009 08:49:21 -0700

Have not checked gzip out yet but Hive is happy with .bz2 files. Thedocumentation on this is spotty. It seems that any Hadoop supportedcompression will work. The issue with .gz files is that they will not besplittable. That is one map will process an entire file so if your .gzfiles are large and you have more map capability than files you will not beable to make use of it.


On Jul 24, 2009 10:09am, Saurabh Nanda <[email protected]> wrote:

Please excuse my ignorance, but can I import gzip compressed filesdirectly as Hive tables? I have separate gzip files for each days weblogdata. Right now I am gunzipping them and then importing into a raw table.Can I import the gzipped files directly into Hive?

Saurabh.

On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [email protected]>wrote:

I don't think these are splittable. Compression on sequencefiles issplittable across sequencefile blocks.

Ashish

-----Original Message-----

From: Bill Craig [mailto:[email protected]]

Sent: Tuesday, July 21, 2009 8:06 AM

To: [email protected]

Subject: bz2 Splits.

I loaded 5 files of bzip2 compressed data into a table in Hive. Three aresmall test files containing 10,000 records. Two were large ~8Gbcompressed.

When I run a query against the table I see three tasks that completealmost immediately and two tasks that run for a very long time. Itappears to me that Hive/Hadoop is not splitting the input of the *.bz2. Ihave seen some old mails about this, but could not find any resolutionfor this problem. I compressed the files using the Apache bz2 jar, thefile are named *.bz2. I am using Hadoop

0.19.1 r745977

--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: Re: bz2 Splits.

Reply via email to