Damn thanks for the tip Carl. I forgot the version of hadoop amazon is running is a little old.
Any ideas on getting table compression on output to work? For example, do I have to specify exec.compress.output on every run, or should i put that into my hive-site.xml? I assume that isn't stored in the metadata store, is it? Also, how efficient is the block storage? Is there a knob I can adjust on that? Thanks! On Jan 25, 2010, at 10:58 AM, Carl Steinbach wrote: > Hi Adam, > > Hive actually relies on the underlying Hadoop implementation for compression > support, i.e. whether or not Hive can support bz2 compressed files depends on > whether or not the Hadoop cluster the files are stored in supports the bzip2 > compression codec. Support for bzip2 was added in Hadoop 0.19, and it looks > like Amazon's EMR is running a variant of Hadoop 0.18.3, which supports gzip > but not bzip2. > > There is a discussion of these issues on the Amazon EMR help forum here: > http://developer.amazonwebservices.com/connect/thread.jspa?messageID=145636 > > Thanks. > > Carl > > On Mon, Jan 25, 2010 at 10:38 AM, Adam J. O'Donnell <[email protected]> wrote: > All: > > I have some questions regarding hive that I hope you can help me with. I > haven't had too much luck with the documentation on these, so any tips would > be much appreciated. > > I initially posted these on the amazon elastic mapreduce message board, since > some are S3 related, but I have gotten no love there. > > - Can you create an external table that covers .bz2 files? For example, if I > push a bunch of log files to a directory that are .bz2 compressed, can I > directly select rows from the external table? If not, what is the best way of > loading the .bz2 into a temporary Hive table such that I can do wildcarding? > > All of these files are in subdirectories, with the directory names serving as > partition names. > > - Is there some trick to storing compressed Hive tables that isn't clearly > documented? I tried the recipe in the Hive tutorial but didn't have much > luck. Anyone here have any success? This is using Hive 0.4.0 in amazon's > cloud. > > - Has anyone tried to compress the tables via .bz2 instead? Is there an > easy way of stream compressing it when using the s3 interfaces? > > - What is more efficient: storing the tables in S3 as s3:// or s3n://? > > Thanks for your help! > > Adam > > -- Adam J. O'Donnell, Ph.D. Immunet Corporation Cell: +1 (267) 251-0070
