Hi Adam, Hive actually relies on the underlying Hadoop implementation for compression support, i.e. whether or not Hive can support bz2 compressed files depends on whether or not the Hadoop cluster the files are stored in supports the bzip2 compression codec. Support for bzip2 was added in Hadoop 0.19, and it looks like Amazon's EMR is running a variant of Hadoop 0.18.3, which supports gzip but not bzip2.
There is a discussion of these issues on the Amazon EMR help forum here: http://developer.amazonwebservices.com/connect/thread.jspa?messageID=145636 Thanks. Carl On Mon, Jan 25, 2010 at 10:38 AM, Adam J. O'Donnell <[email protected]>wrote: > All: > > I have some questions regarding hive that I hope you can help me with. I > haven't had too much luck with the documentation on these, so any tips would > be much appreciated. > > I initially posted these on the amazon elastic mapreduce message board, > since some are S3 related, but I have gotten no love there. > > - Can you create an external table that covers .bz2 files? For example, if > I push a bunch of log files to a directory that are .bz2 compressed, can I > directly select rows from the external table? If not, what is the best way > of loading the .bz2 into a temporary Hive table such that I can do > wildcarding? > > All of these files are in subdirectories, with the directory names serving > as partition names. > > - Is there some trick to storing compressed Hive tables that isn't clearly > documented? I tried the recipe in the Hive tutorial but didn't have much > luck. Anyone here have any success? This is using Hive 0.4.0 in amazon's > cloud. > > - Has anyone tried to compress the tables via .bz2 instead? Is there an > easy way of stream compressing it when using the s3 interfaces? > > - What is more efficient: storing the tables in S3 as s3:// or s3n://? > > Thanks for your help! > > Adam > >
