gz files work fine.  We're attaching daily directories of gziped logs
in S3 as hive table partitions.

Best to have your logrotator do hourly rotation to create lots of gz
files for better mapping.  OR one could use zcat, split, and gzip to
divide into smaller chunks if you really only have one gz file per
partition.

On Fri, Jul 24, 2009 at 9:48 AM, <[email protected]> wrote:
> Have not checked gzip out yet but Hive is happy with .bz2 files. The
> documentation on this is spotty. It seems that any Hadoop supported
> compression will work. The issue with .gz files is that they will not be
> splittable. That is one map will process an entire file so if your .gz files
> are large and you have more map capability than files you will not be able
> to make use of it.
>
> On Jul 24, 2009 10:09am, Saurabh Nanda <[email protected]> wrote:
>> Please excuse my ignorance, but can I import gzip compressed files
>> directly as Hive tables? I have separate gzip files for each days weblog
>> data. Right now I am gunzipping them and then importing into a raw table.
>> Can I import the gzipped files directly into Hive?
>>
>>
>> Saurabh.
>>
>> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [email protected]>
>> wrote:
>>
>> I don't think these are splittable. Compression on sequencefiles is
>> splittable across sequencefile blocks.
>>
>>
>>
>> Ashish
>>
>>
>>
>>
>> -----Original Message-----
>>
>> From: Bill Craig [mailto:[email protected]]
>>
>> Sent: Tuesday, July 21, 2009 8:06 AM
>>
>> To: [email protected]
>>
>> Subject: bz2 Splits.
>>
>>
>>
>> I loaded 5 files of bzip2 compressed data into a table in Hive. Three are
>> small test files containing 10,000 records. Two were large ~8Gb compressed.
>>
>> When I run a query against the table I see three tasks that complete
>> almost immediately and two tasks that run for a very long time. It appears
>> to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen
>> some old mails about this, but could not find any resolution for this
>> problem. I compressed the files using the Apache bz2 jar, the file are named
>> *.bz2. I am using Hadoop
>>
>>
>> 0.19.1 r745977
>>
>>
>>
>>
>>
>>
>> --
>> http://nandz.blogspot.com
>> http://foodieforlife.blogspot.com
>>

Reply via email to