Yes. As Joydeep mentioned, splitting of bzip2 comes from hadoop 0.19. What version of hadoop are you using?
Zheng On Tue, Dec 2, 2008 at 8:45 PM, Josh Ferguson <[EMAIL PROTECTED]> wrote: > It does indeed work, sorry for wasting your time. Is mapping these in > parallel part of what hadoop offers? > Josh > > On Dec 2, 2008, at 8:35 PM, Zheng Shao wrote: > > Hi Josh, > > Please file a jira. I tried this on our deployment and it worked. > > create table tmp_zshao_t4 (a string, b string) stored as textfile; > load data inpath '/user/zshao/patch.txt.bz2' overwrite into table > tmp_zshao_t4; > > Please let us know the svn revision if you did "svn co" from apache, or the > svn.version file in the gz file if you downloaded from mirror.facebook.com > . > > Zheng > *From:* Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]<[EMAIL PROTECTED]> > ] > *Sent:* Tuesday, December 02, 2008 7:57 PM > *To:* [email protected] > *Subject:* RE: Compression > > Please file a Jira in that case. this should work. There is a check to > verify that the file type is consistent with the table storage format – but > I believe this should only kick in for sequencefiles .. > > ------------------------------ > *From:* Josh Ferguson [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>] > *Sent:* Tuesday, December 02, 2008 7:55 PM > *To:* [email protected] > *Subject:* Re: Compression > > I'm pretty sure that when I tried to load a .bz2 file into a TEXTFILE type > table using LOAD LOCAL DATA that it complained at me. I'll have to try it > out again but I'm pretty sure it wasn't working. > > Josh > > On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote: > > Yes – from the jiras – bz2 is splitable in hadoop-0.19. > > Hive doesn't have to do anything to support this (although we haven't > tested it). please mark ur tables as 'stored as textfile' (not sure if > that's the default). As long as the file as bz2 extension and hadoop has the > codec that matches that extension – hive will just rely on hadoop to open > these files. We punt on hadoop to decide when/how to split files – so it > should 'just work'. > > But we never tested it J - so please let us know if it actually worked > out. > > I am curious if there's a native codec for bz2 in hadoop? (java codecs are > too slow) > > ------------------------------ > *From:* Josh Ferguson [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>] > *Sent:* Tuesday, December 02, 2008 1:38 AM > *To:* [email protected] > *Subject:* Re: Compression > > I'm not sure, from their wiki: > > Compressed Input > > Compressed files are difficult to process in parallel, since they cannot, > in general, be split into fragments and independently decompressed. However, > if the compression is block-oriented (e.g. bz2), the splitting and parallel > processing is easy to do. > > Pig has inbuilt support for processing .bz2 files in parallel (.gz support > is coming soon). If the input file name extension is .bz2, Pig decompresses > the file on the fly and passes the decompressed input stream to your load > function. For example, > > A = LOAD 'input.bz2' USING myLoad(); > > Multiple instances of myLoad() (as dictated by the degree of parallelism) > will be created and each will be given a fragment of the *decompressed* > version of input.bz2 to process. > > On Dec 2, 2008, at 1:32 AM, Zheng Shao wrote: > > > Can you give a little more details? > For example, you tried a single .bz file as input, and the pig job has 2 or > more mappers? > > I didn't know bz2 was splittable. > > Zheng > On Tue, Dec 2, 2008 at 1:18 AM, Josh Ferguson <[EMAIL PROTECTED]> wrote: > It is splittable because of how the compression uses blocks, Pig does this > out of the box. > > Josh > > > On Dec 2, 2008, at 1:14 AM, Zheng Shao wrote: > It shouldn't be a problem for Hive to support it (by defining your own > input/output file format that does the decompression on the flyer), but we > won't be able to parallelize the execution as we do with uncompressed text > files, and sequence files, since bz2 compression is not splittable. > > > > > -- > Yours, > Zheng > > > > > -- Yours, Zheng
