Hi All,

We have been using Hadoop (0.20.2+320) and Hive (0.5.0+20) for about a month now to see if we could migrate our existing MySQL DB into a Hadoop/Hive architecture (hadoop/hive rock BTW! :) ). We unfortunately are experiencing slow response times while doing simple tasks such as a count DB query (e.g. hive> select count(blah_id) from blah;). We currently have 2.5B Data Points residing in a single table and hive will take approximately 5-6 minutes to do a count of these 2.5B records (15-17 minutes for 6.8B records). The reduce portion is fast (single reduce since this is a count * query) but the map stage takes the remainder of the time (~95%). We currently have 6 (4 x quad core) systems with approximately 24GB of ram each. We have attempted to add more nodes, increase map tasktrackers (many different #s), change DFS block size (32M, 64M, 128MB, 256M), LZO compression, and many, many other configuration variables (io.sort.factor,io.sort.mb) without much success in lowering the time it takes to complete the count (I do notice a high IO wait on the nodes..no matter how many tasktrackers I run). The size of the DB is approximately ~200GB and with MySQL it takes a few seconds to do both the 2.5B and 6.7B count (I am curious if running this locally without any nodes would result in a quicker response time since the delay appears to be in the mapping stage...). I have come to believe (and read) that hadoop/hive is unfortunately not well suited for this type of work and instead is suited for larger data sets. I am curious if anyone has any ideas on A) improving performance and/or B) similar experiences? I am also curious if maybe something like HBase would be better suited for this type of data (small dataset, many files). We appreciate any input, suggestions, or ideas!.

Thank you!!
Paul Zimdars

Reply via email to