RE: Compression

Joydeep Sen Sarma Tue, 02 Dec 2008 19:57:30 -0800

Please file a Jira in that case. this should work. There is a check to verify 
that the file type is consistent with the table storage format - but I believe 
this should only kick in for sequencefiles ..

________________________________
From: Josh Ferguson [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2008 7:55 PM
To: [email protected]
Subject: Re: Compression

I'm pretty sure that when I tried to load a .bz2 file into a TEXTFILE type 
table using LOAD LOCAL DATA that it complained at me. I'll have to try it out 
again but I'm pretty sure it wasn't working.

Josh

On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:

Yes - from the jiras - bz2 is splitable in hadoop-0.19.

Hive doesn't have to do anything to support this (although we haven't tested 
it). please mark ur tables as 'stored as textfile' (not sure if that's the 
default). As long as the file as bz2 extension and hadoop has the codec that 
matches that extension - hive will just rely on hadoop to open these files. We 
punt on hadoop to decide when/how to split files - so it should 'just work'.

But we never tested it :) - so please let us know if it actually worked out.

I am curious if there's a native codec for bz2 in hadoop? (java codecs are too 
slow)

________________________________
From: Josh Ferguson [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2008 1:38 AM
To: [email protected]<mailto:[EMAIL PROTECTED]>
Subject: Re: Compression

I'm not sure, from their wiki:

Compressed Input

Compressed files are difficult to process in parallel, since they cannot, in 
general, be split into fragments and independently decompressed. However, if 
the compression is block-oriented (e.g. bz2), the splitting and parallel 
processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support is 
coming soon). If the input file name extension is .bz2, Pig decompresses the 
file on the fly and passes the decompressed input stream to your load function. 
For example,

A = LOAD 'input.bz2' USING myLoad();

Multiple instances of myLoad() (as dictated by the degree of parallelism) will 
be created and each will be given a fragment of the *decompressed* version of 
input.bz2 to process.

On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:

Can you give a little more details?
For example, you tried a single .bz file as input, and the pig job has 2 or 
more mappers?

I didn't know bz2 was splittable.

Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <[EMAIL PROTECTED]<mailto:[EMAIL 
PROTECTED]>> wrote:
It is splittable because of how the compression uses blocks, Pig does this out 
of the box.

Josh

On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
It shouldn't be a problem for Hive to support it (by defining your own 
input/output file format that does the decompression on the flyer), but we 
won't be able to parallelize the execution as we do with uncompressed text 
files, and sequence files, since bz2 compression is not splittable.

--
Yours,
Zheng

RE: Compression

Reply via email to