Re: Compression

Josh Ferguson Tue, 02 Dec 2008 01:38:40 -0800

I'm not sure, from their wiki:

Compressed Input

Compressed files are difficult to process in parallel, since theycannot, in general, be split into fragments and independentlydecompressed. However, if the compression is block-oriented (e.g.bz2), the splitting and parallel processing is easy to do.

Pig has inbuilt support for processing .bz2 files in parallel (.gzsupport is coming soon). If the input file name extension is .bz2, Pigdecompresses the file on the fly and passes the decompressed inputstream to your load function. For example,


A = LOAD 'input.bz2' USING myLoad();

Multiple instances of myLoad() (as dictated by the degree ofparallelism) will be created and each will be given a fragment of the*decompressed* version of input.bz2 to process.



On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:

Can you give a little more details?
For example, you tried a single .bz file as input, and the pig jobhas 2 or more mappers?
I didn't know bz2 was splittable.

Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <[EMAIL PROTECTED]>wrote:It is splittable because of how the compression uses blocks, Pigdoes this out of the box.
Josh


On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
It shouldn't be a problem for Hive to support it (by defining yourown input/output file format that does the decompression on theflyer), but we won't be able to parallelize the execution as we dowith uncompressed text files, and sequence files, since bz2compression is not splittable.
--
Yours,
Zheng

Re: Compression

Reply via email to