> Can you help put that information into appropriate place on the wiki > (where you see fit)? > Thanks for the help.
Will do. > By the way, I guess we need to debug what went wrong with the > "count(1)" queries. There is definitely something going wrong. My bad here. I think I forgot to import some files when running the queries earlier. The counts are exactly the same. However the timings for "select count(1)" queries are very different. #1 Uncompressed logs in textfile tables: 106sec (filesize of 7,686 MB over 8 uncompressed files) #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8 compressed files) #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB over 126 compressed files) > For the timing, how much mapper slots do you have in your cluster? I have a 4-node cluster with mapred.reduce.tasks=17 Is that what you mean by mapper slots? > Approach #3: > a) import gzip files into textfile table > b) set hive.exec.compress.output to true > c) inserted into sequencefile table > This will create bigger sequencefiles which will help reducing the > overhead. This is better than Approach #2 because jobs from the > sequencefile tables will have more mappers. This is exactly what I did in #3 above. But, from those benchmarks #2 seems to give the best results, both, in terms of file size and speed. Is that not what you were expecting? Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
