Hi all, I have some data in Zebra around 9 TB which I converted first to PlainText using the TextOutputFormat in M/R and it resulted in around 43.07TB. [[I think I used no compression here.]]
I then later converted this data to RC using on the hive console as: CREATE TABLE LARGERC ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" STORED AS RCFile LOCATION '/user/viraj/huge AS SELECT * FROM PLAINTEXT; (PLAINTEXT is the external table which is 43.07 TB in size) The overall sizes of these files were around 41.65 TB. I am suspecting that some compression was not being applied. I read the following documentation: http://hadoop.apache.org/hive/docs/r0.4.0/api/org/apache/hadoop/hive/ql/ io/RCFile.html and it tells that: "The actual compression algorithm used to compress key and/or values can be specified by using the appropriate CompressionCodec" a) What is the default Codec that is being used? b) Any thoughts on how I can reduce the size? Viraj
