there is a config hive.exec.compress.output (pls double check) to contry whether to compress final data or not.
Maybe you can just try to convert data directly from zebra I sent out some code to do that. Have u tried? Also I think maybe it's good to first try to test on some small data before try on such a large datasets. On Friday, June 11, 2010, Viraj Bhat <[email protected]> wrote: > > > > > > > > > > > > > > Hi all, > > I have some data in Zebra around 9 TB which I > converted first to PlainText using the TextOutputFormat in M/R and it > resulted in > around 43.07TB. [[I think I used no compression here.]] > > I then later converted this data to RC using on the hive > console as: > > > > CREATE TABLE LARGERC > > ROW FORMAT SERDE > "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" > > STORED AS RCFile > > LOCATION '/user/viraj/huge AS > > SELECT * FROM PLAINTEXT; > > > > (PLAINTEXT is the external table which is 43.07 TB in size) > > > > The overall sizes of these files were around 41.65 TB. I am > suspecting that some compression was not being applied. > > > > I read the following documentation: > > http://hadoop.apache.org/hive/docs/r0.4.0/api/org/apache/hadoop/hive/ql/io/RCFile.html > and it tells that: “The actual compression algorithm used to compress key > and/or values can be specified by using the appropriate CompressionCodec” > > > > a) What is the > default Codec that is being used? > > b) Any thoughts > on how I can reduce the size? > > Viraj > > > > > > >
