I tried the following and ran into an error message:
create table compressed_raw(line string) partitioned by(dt string)
row format delimited fields terminated by '\t' lines terminated by '\n'
stored as sequencefile;
hive> load data local inpath
'/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz' into table
compressed_raw partition(dt='2009-06-01');
Copying data from file:/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz
Loading data to table compressed_raw partition {dt=2009-06-01}
Failed with exception Cannot load text files into a table stored as
SequenceFile.
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
I guess this is what the following thread is talking about --
http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200903.mbox/%[email protected]%3e
To sum up the discussion there, do I have to first import into a textfile
table, set hive.exec.compress.output to true, and then insert into a
sequencefile table? If that's the case, I don't understand why I have to
explicitly set hive.exec.compress.output? Shouldn't the fact that the target
is a sequencefile table, achieve the desired result?
I'm on hadoop-0.18.3 & hive-0.3.0
PS: More details on the Wiki around compresses storage would be really
appreciated.
Saurabh.
On Fri, Jul 24, 2009 at 10:02 PM, Neal Richter <[email protected]> wrote:
> gz files work fine. We're attaching daily directories of gziped logs
> in S3 as hive table partitions.
>
> Best to have your logrotator do hourly rotation to create lots of gz
> files for better mapping. OR one could use zcat, split, and gzip to
> divide into smaller chunks if you really only have one gz file per
> partition.
>
> On Fri, Jul 24, 2009 at 9:48 AM, <[email protected]> wrote:
> > Have not checked gzip out yet but Hive is happy with .bz2 files. The
> > documentation on this is spotty. It seems that any Hadoop supported
> > compression will work. The issue with .gz files is that they will not be
> > splittable. That is one map will process an entire file so if your .gz
> files
> > are large and you have more map capability than files you will not be
> able
> > to make use of it.
> >
> > On Jul 24, 2009 10:09am, Saurabh Nanda <[email protected]> wrote:
> >> Please excuse my ignorance, but can I import gzip compressed files
> >> directly as Hive tables? I have separate gzip files for each days weblog
> >> data. Right now I am gunzipping them and then importing into a raw
> table.
> >> Can I import the gzipped files directly into Hive?
> >>
> >>
> >> Saurabh.
> >>
> >> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [email protected]>
> >> wrote:
> >>
> >> I don't think these are splittable. Compression on sequencefiles is
> >> splittable across sequencefile blocks.
> >>
> >>
> >>
> >> Ashish
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >>
> >> From: Bill Craig [mailto:[email protected]]
> >>
> >> Sent: Tuesday, July 21, 2009 8:06 AM
> >>
> >> To: [email protected]
> >>
> >> Subject: bz2 Splits.
> >>
> >>
> >>
> >> I loaded 5 files of bzip2 compressed data into a table in Hive. Three
> are
> >> small test files containing 10,000 records. Two were large ~8Gb
> compressed.
> >>
> >> When I run a query against the table I see three tasks that complete
> >> almost immediately and two tasks that run for a very long time. It
> appears
> >> to me that Hive/Hadoop is not splitting the input of the *.bz2. I have
> seen
> >> some old mails about this, but could not find any resolution for this
> >> problem. I compressed the files using the Apache bz2 jar, the file are
> named
> >> *.bz2. I am using Hadoop
> >>
> >>
> >> 0.19.1 r745977
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> http://nandz.blogspot.com
> >> http://foodieforlife.blogspot.com
> >>
>
--
http://nandz.blogspot.com
http://foodieforlife.blogspot.com