Re: Re: bz2 Splits.

Zheng Shao Sat, 25 Jul 2009 03:15:11 -0700

Hi Saurabh,

If you want to load data (in compressed/uncompressed text format) into
a table, you have to defined the table as "stored as textfile" instead
of "stored as sequencefile".


Can you try again and let us know?

Zheng

On Sat, Jul 25, 2009 at 3:05 AM, Saurabh Nanda<[email protected]> wrote:
> I tried the following and ran into an error message:
>
> create table compressed_raw(line string) partitioned by(dt string)
> row format delimited fields terminated by '\t' lines terminated by '\n'
> stored as sequencefile;
>
> hive> load data local inpath
> '/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz' into table
> compressed_raw partition(dt='2009-06-01');
> Copying data from file:/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz
> Loading data to table compressed_raw partition {dt=2009-06-01}
> Failed with exception Cannot load text files into a table stored as
> SequenceFile.
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.MoveTask
>
> I guess this is what the following thread is talking about --
> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200903.mbox/%[email protected]%3e
>
> To sum up the discussion there, do I have to first import into a textfile
> table, set hive.exec.compress.output to true, and then insert into a
> sequencefile table? If that's the case, I don't understand why I have to
> explicitly set hive.exec.compress.output? Shouldn't the fact that the target
> is a sequencefile table, achieve the desired result?
>
> I'm on hadoop-0.18.3 & hive-0.3.0
>
> PS: More details on the Wiki around compresses storage would be really
> appreciated.
>
> Saurabh.
>
> On Fri, Jul 24, 2009 at 10:02 PM, Neal Richter <[email protected]> wrote:
>>
>> gz files work fine.  We're attaching daily directories of gziped logs
>> in S3 as hive table partitions.
>>
>> Best to have your logrotator do hourly rotation to create lots of gz
>> files for better mapping.  OR one could use zcat, split, and gzip to
>> divide into smaller chunks if you really only have one gz file per
>> partition.
>>
>> On Fri, Jul 24, 2009 at 9:48 AM, <[email protected]> wrote:
>> > Have not checked gzip out yet but Hive is happy with .bz2 files. The
>> > documentation on this is spotty. It seems that any Hadoop supported
>> > compression will work. The issue with .gz files is that they will not be
>> > splittable. That is one map will process an entire file so if your .gz
>> > files
>> > are large and you have more map capability than files you will not be
>> > able
>> > to make use of it.
>> >
>> > On Jul 24, 2009 10:09am, Saurabh Nanda <[email protected]> wrote:
>> >> Please excuse my ignorance, but can I import gzip compressed files
>> >> directly as Hive tables? I have separate gzip files for each days
>> >> weblog
>> >> data. Right now I am gunzipping them and then importing into a raw
>> >> table.
>> >> Can I import the gzipped files directly into Hive?
>> >>
>> >>
>> >> Saurabh.
>> >>
>> >> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [email protected]>
>> >> wrote:
>> >>
>> >> I don't think these are splittable. Compression on sequencefiles is
>> >> splittable across sequencefile blocks.
>> >>
>> >>
>> >>
>> >> Ashish
>> >>
>> >>
>> >>
>> >>
>> >> -----Original Message-----
>> >>
>> >> From: Bill Craig [mailto:[email protected]]
>> >>
>> >> Sent: Tuesday, July 21, 2009 8:06 AM
>> >>
>> >> To: [email protected]
>> >>
>> >> Subject: bz2 Splits.
>> >>
>> >>
>> >>
>> >> I loaded 5 files of bzip2 compressed data into a table in Hive. Three
>> >> are
>> >> small test files containing 10,000 records. Two were large ~8Gb
>> >> compressed.
>> >>
>> >> When I run a query against the table I see three tasks that complete
>> >> almost immediately and two tasks that run for a very long time. It
>> >> appears
>> >> to me that Hive/Hadoop is not splitting the input of the *.bz2. I have
>> >> seen
>> >> some old mails about this, but could not find any resolution for this
>> >> problem. I compressed the files using the Apache bz2 jar, the file are
>> >> named
>> >> *.bz2. I am using Hadoop
>> >>
>> >>
>> >> 0.19.1 r745977
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> http://nandz.blogspot.com
>> >> http://foodieforlife.blogspot.com
>> >>
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: Re: bz2 Splits.

Reply via email to