> Can you help put that information into appropriate place on the wiki
> (where you see fit)?
> Thanks for the help.


Will do.


> By the way, I guess we need to debug what went wrong with the
> "count(1)" queries. There is definitely something going wrong.


My bad here. I think I forgot to import some files when running the queries
earlier. The counts are exactly the same. However the timings for "select
count(1)" queries are very different.

#1 Uncompressed logs in textfile tables: 106sec (filesize of 7,686 MB over 8
uncompressed files)
#2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8
compressed files)
#3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB over
126 compressed files)



> For the timing, how much mapper slots do you have in your cluster?


I have a 4-node cluster with mapred.reduce.tasks=17 Is that what you mean by
mapper slots?


> Approach #3:
> a) import gzip files into textfile table
> b) set hive.exec.compress.output to true
> c) inserted into sequencefile table
> This will create bigger sequencefiles which will help reducing the
> overhead. This is better than Approach #2 because jobs from the
> sequencefile tables will have more mappers.


This is exactly what I did in #3 above. But, from those benchmarks #2 seems
to give the best results, both, in terms of file size and speed. Is that not
what you were expecting?

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Reply via email to