There are some work along this direction in the hadoop land, but it's not committed yet: https://issues.apache.org/jira/browse/HADOOP-4012
For the short term, we won't be able to split bzip files. If your bzip files are generated outside of hadoop, please split the files before doing compression (so you will load many smaller files to hadoop/hive). If your bzip files are generated by hadoop/hive, please change the output file format to SequenceFile format. SequenceFile formats are splittable. Zheng On Tue, Jul 21, 2009 at 12:37 PM, Ashish Thusoo<[email protected]> wrote: > I don't think these are splittable. Compression on sequencefiles is > splittable across sequencefile blocks. > > Ashish > > -----Original Message----- > From: Bill Craig [mailto:[email protected]] > Sent: Tuesday, July 21, 2009 8:06 AM > To: [email protected] > Subject: bz2 Splits. > > I loaded 5 files of bzip2 compressed data into a table in Hive. Three are > small test files containing 10,000 records. Two were large ~8Gb compressed. > When I run a query against the table I see three tasks that complete almost > immediately and two tasks that run for a very long time. It appears to me > that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some > old mails about this, but could not find any resolution for this problem. I > compressed the files using the Apache bz2 jar, the file are named *.bz2. I am > using Hadoop > 0.19.1 r745977 > -- Yours, Zheng
