Re: Help with Compressed Storage

prasenjit mukherjee Wed, 17 Feb 2010 23:22:02 -0800

So this is the command I ran, first with  with small.gz (which worked fine)
and  then with small.bz2 ( which didnt work )  :


drop table small_table;
CREATE  TABLE small_table(id1 string, id2 string, id3 string) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH '/root/data/small.gz' OVERWRITE INTO TABLE
small_table;
select * from small_table limit 1;

For gz files I do see the following lines in hive_debug :
10/02/18 01:59:23 DEBUG ipc.RPC: Call: getBlockLocations 1
10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Trying to load the
custom-built native-hadoop library...
10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Failed to load native-hadoop
with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
10/02/18 01:59:23 DEBUG util.NativeCodeLoader:
java.library.path=/usr/java/jdk1.6.0_14/jre/lib/amd64/server:/usr/java/jdk1.6.0_14/jre/lib/amd64:/usr/java/jdk1.6.0_14/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
10/02/18 01:59:23 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
10/02/18 01:59:23 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
offsetInBlock 0 lastPacketInBlock true packetLen 88
aid1     bid2     cid3

But for bzip files there is none :
10/02/18 01:57:18 DEBUG ipc.RPC: Call: getBlockLocations 2
10/02/18 01:57:18 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
offsetInBlock 0 lastPacketInBlock true packetLen 85
10/02/18 01:57:18 WARN lazy.LazyStruct: Missing fields! Expected 3 fields
but only got 1! Ignoring similar problems.
BZh91AY&SYǧ    �"y...@><  TP?*�"��SFL�c����ѶѶ�$��
                                                     �w��U�)�=8O�
NULL    NULL


Let me know if you still need the debug files. Attached are the small.gz and
small.bzip2 files.

Thanks and appreciate,
-Prasen

On Thu, Feb 18, 2010 at 11:52 AM, Zheng Shao <[email protected]> wrote:

> There is no special setting for bz2.
>
> Can you get the debug log?
>
> Zheng
>
> On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee
> <[email protected]> wrote:
> > So I tried the same with  .gz files and it worked. I am using the
> following
> > hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I
> thought
> > that hadoop0.20 does support bz2 compression, hence same should work with
> > hive as well.
> >
> > Interesting note is that Pig works fine on the same bz2 data.  Is there
> any
> > tweaking/config setup I need to do for hive to take bz2 files as input ?
> >
> > On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee
> > <[email protected]> wrote:
> >>
> >> I have a similar issue with bz2 files. I have the hadoop directories :
> >>
> >> /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
> >> /ip/datacompressed/ : containing same files bzipped (  foo1.bz2,
> foo2.bz2
> >> )
> >>
> >> CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
> >>    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
> >>    LOCATION '/ip/datacompressed/';
> >> SELECT *  FROM tx_log limit 1;
> >>
> >> The command works fine with LOCATION '/ip/data/' but doesnt work with
> >> LOCATION '/ip/datacompressed/'
> >>
> >> Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
> >> extensions and applies appropriate decompression. Am I wrong ?
> >>
> >> -Prasen
> >>
> >>
> >> On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <[email protected]> wrote:
> >>>
> >>> I just corrected the wiki page. It will also be a good idea to support
> >>> case-insensitive boolean values in the code.
> >>>
> >>> Zheng
> >>>
> >>> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <
> [email protected]>
> >>> wrote:
> >>> > Thanks Adam, that works for me as well.
> >>> > It seems that the property for hive.exec.compress.output is case
> >>> > sensitive,
> >>> > and when it is set to TRUE (as it is on the compressed storage page
> on
> >>> > the
> >>> > wiki) it is ignored by hive.
> >>> >
> >>> > -Brent
> >>> >
> >>> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <[email protected]>
> >>> > wrote:
> >>> >>
> >>> >> Adding these to my hive-site.xml file worked fine:
> >>> >>
> >>> >>  <property>
> >>> >>        <name>hive.exec.compress.output</name>
> >>> >>        <value>true</value>
> >>> >>        <description>Compress output</description>
> >>> >>  </property>
> >>> >>
> >>> >>  <property>
> >>> >>        <name>mapred.output.compression.type</name>
> >>> >>        <value>BLOCK</value>
> >>> >>        <description>Block compression</description>
> >>> >>  </property>
> >>> >>
> >>> >>
> >>> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller
> >>> >> <[email protected]>
> >>> >> wrote:
> >>> >> > Hello, I've seen issues similar to this one come up once or twice
> >>> >> > before,
> >>> >> > but I haven't ever seen a solution to the problem that I'm having.
> I
> >>> >> > was
> >>> >> > following the Compressed Storage page on the Hive
> >>> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
> >>> >> > that
> >>> >> > the
> >>> >> > sequence files that are created in the warehouse directory are
> >>> >> > actually
> >>> >> > uncompressed and larger than than the originals.
> >>> >> > For example, I have a table 'test1' who's input data looks
> something
> >>> >> > like:
> >>> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
> >>> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
> >>> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
> >>> >> > ...
> >>> >> > And after creating a second table 'test1_comp' that was crated
> with
> >>> >> > the
> >>> >> > STORED AS SEQUENCEFILE directive and the compression options SET
> as
> >>> >> > described in the wiki, I can look at the resultant sequence files
> >>> >> > and
> >>> >> > see
> >>> >> > that they're just plain (uncompressed) text:
> >>> >> > SEQ "org.apache.hadoop.io.BytesWritable
> >>> >> > org.apache.hadoop.io.Text+�c�!Y�M ��
> >>> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
> >>> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
> >>> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
> >>> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
> >>> >> > ...
> >>> >> > I've tried messing around with
> >>> >> > different org.apache.hadoop.io.compress.*
> >>> >> > options, but the sequence files always come out uncompressed. Has
> >>> >> > anybody
> >>> >> > ever seen this or know away to keep the data compressed? Since the
> >>> >> > input
> >>> >> > text is so uniform, we get huge space savings from compression and
> >>> >> > would
> >>> >> > like to store the data this way if possible. I'm using Hadoop 20.1
> >>> >> > and
> >>> >> > Hive
> >>> >> > that I checked out from SVN about a week ago.
> >>> >> > Thanks,
> >>> >> > Brent
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Adam J. O'Donnell, Ph.D.
> >>> >> Immunet Corporation
> >>> >> Cell: +1 (267) 251-0070
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Yours,
> >>> Zheng
> >>
> >
> >
>
>
>
> --
> Yours,
> Zheng
>

small.bz2
Description: BZip2 compressed data

small.gz
Description: GNU Zip compressed data

Re: Help with Compressed Storage

Reply via email to