> If you want to load data (in compressed/uncompressed text format) into >> a table, you have to defined the table as "stored as textfile" instead >> of "stored as sequencefile". > >
I tried both the approaches. Approach #1: a) gunzip log file b) import into textfile table c) set hive.exec.compress.output to true d) inserted into sequencefile table It seems to have given me 125 files named 'attempt_*' in the partition's directory. All under 10MB. (How do I find out the total size of a directory? Need to see how much saving the compression resulted in) Approach #2: imported gzip log files into a textfile table The files seem to have been copied as-is into the partition's directory. But every query is always split up into 8 maps (which is the number of files I imported). This, I guess won't help me much because I would be under utilizing the map power I have. Here's something interesting. I ran a SELECT COUNT(1) on all the three tables and go different results and wildly different response times. Gunzipped files imported into textfile table: 8,259,720 (108 sec) sequencefile table populated by step 1d above: 8,316,946 (114 sec) Gzip files imported into textfile tables: 8,619,980 (50 sec) How is a simple row count differing? And surprisingly lesser maps resulted in better performance! Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
