I'm pretty sure that when I tried to load a .bz2 file into a TEXTFILE
type table using LOAD LOCAL DATA that it complained at me. I'll have
to try it out again but I'm pretty sure it wasn't working.
Josh
On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:
Yes – from the jiras – bz2 is splitable in hadoop-0.19.
Hive doesn’t have to do anything to support this (although we
haven’t tested it). please mark ur tables as ‘stored as
textfile’ (not sure if that’s the default). As long as the file as
bz2 extension and hadoop has the codec that matches that extension –
hive will just rely on hadoop to open these files. We punt on hadoop
to decide when/how to split files – so it should ‘just work’.
But we never tested it J - so please let us know if it actually
worked out.
I am curious if there’s a native codec for bz2 in hadoop? (java
codecs are too slow)
From: Josh Ferguson [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2008 1:38 AM
To: [email protected]
Subject: Re: Compression
I'm not sure, from their wiki:
Compressed Input
Compressed files are difficult to process in parallel, since they
cannot, in general, be split into fragments and independently
decompressed. However, if the compression is block-oriented (e.g.
bz2), the splitting and parallel processing is easy to do.
Pig has inbuilt support for processing .bz2 files in parallel (.gz
support is coming soon). If the input file name extension is .bz2,
Pig decompresses the file on the fly and passes the decompressed
input stream to your load function. For example,
A = LOAD 'input.bz2' USING myLoad();
Multiple instances of myLoad() (as dictated by the degree of
parallelism) will be created and each will be given a fragment of
the *decompressed* version of input.bz2 to process.
On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:
Can you give a little more details?
For example, you tried a single .bz file as input, and the pig job
has 2 or more mappers?
I didn't know bz2 was splittable.
Zheng
On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <[EMAIL PROTECTED]>
wrote:
It is splittable because of how the compression uses blocks, Pig
does this out of the box.
Josh
On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
It shouldn't be a problem for Hive to support it (by defining your
own input/output file format that does the decompression on the
flyer), but we won't be able to parallelize the execution as we do
with uncompressed text files, and sequence files, since bz2
compression is not splittable.
--
Yours,
Zheng