Re: Compression

Josh Ferguson Tue, 02 Dec 2008 20:45:57 -0800

It does indeed work, sorry for wasting your time. Is mapping these inparallel part of what hadoop offers?


Josh


On Dec 2, 2008, at 8:35 PM, Zheng Shao wrote:

Hi Josh,

Please file a jira. I tried this on our deployment and it worked.

create table tmp_zshao_t4 (a string, b string) stored as textfile;
load data inpath '/user/zshao/patch.txt.bz2' overwrite into tabletmp_zshao_t4;
Please let us know the svn revision if you did “svn co” from apache,or the svn.version file in the gz file if you downloaded frommirror.facebook.com.
Zheng
From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2008 7:57 PM
To: [email protected]
Subject: RE: Compression
Please file a Jira in that case. this should work. There is a checkto verify that the file type is consistent with the table storageformat – but I believe this should only kick in for sequencefiles ..
From: Josh Ferguson [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2008 7:55 PM
To: [email protected]
Subject: Re: Compression
I'm pretty sure that when I tried to load a .bz2 file into aTEXTFILE type table using LOAD LOCAL DATA that it complained at me.I'll have to try it out again but I'm pretty sure it wasn't working.
Josh

On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:

Yes – from the jiras – bz2 is splitable in hadoop-0.19.
Hive doesn’t have to do anything to support this (although wehaven’t tested it). please mark ur tables as ‘stored astextfile’ (not sure if that’s the default). As long as the file asbz2 extension and hadoop has the codec that matches that extension –hive will just rely on hadoop to open these files. We punt on hadoopto decide when/how to split files – so it should ‘just work’.
But we never tested it J - so please let us know if it actuallyworked out.
I am curious if there’s a native codec for bz2 in hadoop? (javacodecs are too slow)
From: Josh Ferguson [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2008 1:38 AM
To: [email protected]
Subject: Re: Compression

I'm not sure, from their wiki:

Compressed Input
Compressed files are difficult to process in parallel, since theycannot, in general, be split into fragments and independentlydecompressed. However, if the compression is block-oriented (e.g.bz2), the splitting and parallel processing is easy to do.
Pig has inbuilt support for processing .bz2 files in parallel (.gzsupport is coming soon). If the input file name extension is .bz2,Pig decompresses the file on the fly and passes the decompressedinput stream to your load function. For example,
A = LOAD 'input.bz2' USING myLoad();
Multiple instances of myLoad() (as dictated by the degree ofparallelism) will be created and each will be given a fragment ofthe *decompressed* version of input.bz2 to process.
On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:


Can you give a little more details?
For example, you tried a single .bz file as input, and the pig jobhas 2 or more mappers?
I didn't know bz2 was splittable.

Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <[EMAIL PROTECTED]>wrote:It is splittable because of how the compression uses blocks, Pigdoes this out of the box.
Josh


On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
It shouldn't be a problem for Hive to support it (by defining yourown input/output file format that does the decompression on theflyer), but we won't be able to parallelize the execution as we dowith uncompressed text files, and sequence files, since bz2compression is not splittable.
--
Yours,
Zheng

Re: Compression

Reply via email to