Re: Help with Compressed Storage

Zheng Shao Wed, 17 Feb 2010 23:25:21 -0800

Just remember that we need to have the BZipCodec class in the
following hadoop configuration:
Can you check?


io.compression.codecs

Zheng

On Wed, Feb 17, 2010 at 11:21 PM, prasenjit mukherjee
<[email protected]> wrote:
> So this is the command I ran, first with  with small.gz (which worked fine)
> and  then with small.bz2 ( which didnt work )  :
>
> drop table small_table;
> CREATE  TABLE small_table(id1 string, id2 string, id3 string) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',';
> LOAD DATA LOCAL INPATH '/root/data/small.gz' OVERWRITE INTO TABLE
> small_table;
> select * from small_table limit 1;
>
> For gz files I do see the following lines in hive_debug :
> 10/02/18 01:59:23 DEBUG ipc.RPC: Call: getBlockLocations 1
> 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Trying to load the
> custom-built native-hadoop library...
> 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Failed to load native-hadoop
> with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
> 10/02/18 01:59:23 DEBUG util.NativeCodeLoader:
> java.library.path=/usr/java/jdk1.6.0_14/jre/lib/amd64/server:/usr/java/jdk1.6.0_14/jre/lib/amd64:/usr/java/jdk1.6.0_14/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
> 10/02/18 01:59:23 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 10/02/18 01:59:23 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
> offsetInBlock 0 lastPacketInBlock true packetLen 88
> aid1     bid2     cid3
>
> But for bzip files there is none :
> 10/02/18 01:57:18 DEBUG ipc.RPC: Call: getBlockLocations 2
> 10/02/18 01:57:18 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
> offsetInBlock 0 lastPacketInBlock true packetLen 85
> 10/02/18 01:57:18 WARN lazy.LazyStruct: Missing fields! Expected 3 fields
> but only got 1! Ignoring similar problems.
> BZh91AY&SYǧ    �"Y @ ><  TP?* �"��SFL� c����ѶѶ�$� �
>                                                      �w��U�)„�=8O�
> NULL    NULL
>
>
> Let me know if you still need the debug files. Attached are the small.gz and
> small.bzip2 files.
>
> Thanks and appreciate,
> -Prasen
>
> On Thu, Feb 18, 2010 at 11:52 AM, Zheng Shao <[email protected]> wrote:
>>
>> There is no special setting for bz2.
>>
>> Can you get the debug log?
>>
>> Zheng
>>
>> On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee
>> <[email protected]> wrote:
>> > So I tried the same with  .gz files and it worked. I am using the
>> > following
>> > hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I
>> > thought
>> > that hadoop0.20 does support bz2 compression, hence same should work
>> > with
>> > hive as well.
>> >
>> > Interesting note is that Pig works fine on the same bz2 data.  Is there
>> > any
>> > tweaking/config setup I need to do for hive to take bz2 files as input ?
>> >
>> > On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee
>> > <[email protected]> wrote:
>> >>
>> >> I have a similar issue with bz2 files. I have the hadoop directories :
>> >>
>> >> /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
>> >> /ip/datacompressed/ : containing same files bzipped (  foo1.bz2,
>> >> foo2.bz2
>> >> )
>> >>
>> >> CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
>> >>    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
>> >>    LOCATION '/ip/datacompressed/';
>> >> SELECT *  FROM tx_log limit 1;
>> >>
>> >> The command works fine with LOCATION '/ip/data/' but doesnt work with
>> >> LOCATION '/ip/datacompressed/'
>> >>
>> >> Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
>> >> extensions and applies appropriate decompression. Am I wrong ?
>> >>
>> >> -Prasen
>> >>
>> >>
>> >> On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <[email protected]> wrote:
>> >>>
>> >>> I just corrected the wiki page. It will also be a good idea to support
>> >>> case-insensitive boolean values in the code.
>> >>>
>> >>> Zheng
>> >>>
>> >>> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller
>> >>> <[email protected]>
>> >>> wrote:
>> >>> > Thanks Adam, that works for me as well.
>> >>> > It seems that the property for hive.exec.compress.output is case
>> >>> > sensitive,
>> >>> > and when it is set to TRUE (as it is on the compressed storage page
>> >>> > on
>> >>> > the
>> >>> > wiki) it is ignored by hive.
>> >>> >
>> >>> > -Brent
>> >>> >
>> >>> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <[email protected]>
>> >>> > wrote:
>> >>> >>
>> >>> >> Adding these to my hive-site.xml file worked fine:
>> >>> >>
>> >>> >>  <property>
>> >>> >>        <name>hive.exec.compress.output</name>
>> >>> >>        <value>true</value>
>> >>> >>        <description>Compress output</description>
>> >>> >>  </property>
>> >>> >>
>> >>> >>  <property>
>> >>> >>        <name>mapred.output.compression.type</name>
>> >>> >>        <value>BLOCK</value>
>> >>> >>        <description>Block compression</description>
>> >>> >>  </property>
>> >>> >>
>> >>> >>
>> >>> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller
>> >>> >> <[email protected]>
>> >>> >> wrote:
>> >>> >> > Hello, I've seen issues similar to this one come up once or twice
>> >>> >> > before,
>> >>> >> > but I haven't ever seen a solution to the problem that I'm
>> >>> >> > having. I
>> >>> >> > was
>> >>> >> > following the Compressed Storage page on the Hive
>> >>> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
>> >>> >> > that
>> >>> >> > the
>> >>> >> > sequence files that are created in the warehouse directory are
>> >>> >> > actually
>> >>> >> > uncompressed and larger than than the originals.
>> >>> >> > For example, I have a table 'test1' who's input data looks
>> >>> >> > something
>> >>> >> > like:
>> >>> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
>> >>> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
>> >>> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
>> >>> >> > ...
>> >>> >> > And after creating a second table 'test1_comp' that was crated
>> >>> >> > with
>> >>> >> > the
>> >>> >> > STORED AS SEQUENCEFILE directive and the compression options SET
>> >>> >> > as
>> >>> >> > described in the wiki, I can look at the resultant sequence files
>> >>> >> > and
>> >>> >> > see
>> >>> >> > that they're just plain (uncompressed) text:
>> >>> >> > SEQ "org.apache.hadoop.io.BytesWritable
>> >>> >> > org.apache.hadoop.io.Text+�c�!Y�M ��
>> >>> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
>> >>> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
>> >>> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
>> >>> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
>> >>> >> > ...
>> >>> >> > I've tried messing around with
>> >>> >> > different org.apache.hadoop.io.compress.*
>> >>> >> > options, but the sequence files always come out uncompressed. Has
>> >>> >> > anybody
>> >>> >> > ever seen this or know away to keep the data compressed? Since
>> >>> >> > the
>> >>> >> > input
>> >>> >> > text is so uniform, we get huge space savings from compression
>> >>> >> > and
>> >>> >> > would
>> >>> >> > like to store the data this way if possible. I'm using Hadoop
>> >>> >> > 20.1
>> >>> >> > and
>> >>> >> > Hive
>> >>> >> > that I checked out from SVN about a week ago.
>> >>> >> > Thanks,
>> >>> >> > Brent
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> --
>> >>> >> Adam J. O'Donnell, Ph.D.
>> >>> >> Immunet Corporation
>> >>> >> Cell: +1 (267) 251-0070
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Yours,
>> >>> Zheng
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>
>



-- 
Yours,
Zheng

Re: Help with Compressed Storage

Reply via email to