Yes. As Joydeep mentioned, splitting of bzip2 comes from hadoop 0.19.
What version of hadoop are you using?

Zheng

On Tue, Dec 2, 2008 at 8:45 PM, Josh Ferguson <[EMAIL PROTECTED]> wrote:

> It does indeed work, sorry for wasting your time. Is mapping these in
> parallel part of what hadoop offers?
> Josh
>
> On Dec 2, 2008, at 8:35 PM, Zheng Shao wrote:
>
> Hi Josh,
>
> Please file a jira. I tried this on our deployment and it worked.
>
> create table tmp_zshao_t4 (a string, b string) stored as textfile;
> load data inpath '/user/zshao/patch.txt.bz2' overwrite into table
> tmp_zshao_t4;
>
> Please let us know the svn revision if you did "svn co" from apache, or the
> svn.version file in the gz file if you downloaded from mirror.facebook.com
> .
>
> Zheng
> *From:* Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]<[EMAIL PROTECTED]>
> ]
> *Sent:* Tuesday, December 02, 2008 7:57 PM
> *To:* [email protected]
> *Subject:* RE: Compression
>
> Please file a Jira in that case. this should work. There is a check to
> verify that the file type is consistent with the table storage format – but
> I believe this should only kick in for sequencefiles ..
>
> ------------------------------
> *From:* Josh Ferguson [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>]
> *Sent:* Tuesday, December 02, 2008 7:55 PM
> *To:* [email protected]
> *Subject:* Re: Compression
>
> I'm pretty sure that when I tried to load a .bz2 file into a TEXTFILE type
> table using LOAD LOCAL DATA that it complained at me. I'll have to try it
> out again but I'm pretty sure it wasn't working.
>
> Josh
>
> On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote:
>
> Yes – from the jiras – bz2 is splitable in hadoop-0.19.
>
> Hive doesn't have to do anything to support this (although we haven't
> tested it). please mark ur tables as 'stored as textfile' (not sure if
> that's the default). As long as the file as bz2 extension and hadoop has the
> codec that matches that extension – hive will just rely on hadoop to open
> these files. We punt on hadoop to decide when/how to split files – so it
> should 'just work'.
>
> But we never tested it J - so please let us know if it actually worked
> out.
>
> I am curious if there's a native codec for bz2 in hadoop? (java codecs are
> too slow)
>
> ------------------------------
> *From:* Josh Ferguson [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>]
> *Sent:* Tuesday, December 02, 2008 1:38 AM
> *To:* [email protected]
> *Subject:* Re: Compression
>
> I'm not sure, from their wiki:
>
> Compressed Input
>
> Compressed files are difficult to process in parallel, since they cannot,
> in general, be split into fragments and independently decompressed. However,
> if the compression is block-oriented (e.g. bz2), the splitting and parallel
> processing is easy to do.
>
> Pig has inbuilt support for processing .bz2 files in parallel (.gz support
> is coming soon). If the input file name extension is .bz2, Pig decompresses
> the file on the fly and passes the decompressed input stream to your load
> function. For example,
>
> A = LOAD 'input.bz2' USING myLoad();
>
> Multiple instances of myLoad() (as dictated by the degree of parallelism)
> will be created and each will be given a fragment of the *decompressed*
> version of input.bz2 to process.
>
> On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote:
>
>
> Can you give a little more details?
> For example, you tried a single .bz file as input, and the pig job has 2 or
> more mappers?
>
> I didn't know bz2 was splittable.
>
> Zheng
> On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <[EMAIL PROTECTED]> wrote:
> It is splittable because of how the compression uses blocks, Pig does this
> out of the box.
>
> Josh
>
>
> On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote:
> It shouldn't be a problem for Hive to support it (by defining your own
> input/output file format that does the decompression on the flyer), but we
> won't be able to parallelize the execution as we do with uncompressed text
> files, and sequence files, since bz2 compression is not splittable.
>
>
>
>
> --
> Yours,
> Zheng
>
>
>
>
>


-- 
Yours,
Zheng

Reply via email to