Hadoop/Hive observations

Paul Zimdars Mon, 13 Sep 2010 22:42:22 -0700

 Hi All,

We have been using Hadoop (0.20.2+320) and Hive (0.5.0+20) for about amonth now to see if we could migrate our existing MySQL DB into aHadoop/Hive architecture (hadoop/hive rock BTW! :) ). We unfortunatelyare experiencing slow response times while doing simple tasks such as acount DB query (e.g. hive> select count(blah_id) from blah;). Wecurrently have 2.5B Data Points residing in a single table and hive willtake approximately 5-6 minutes to do a count of these 2.5B records(15-17 minutes for 6.8B records). The reduce portion is fast (singlereduce since this is a count * query) but the map stage takes theremainder of the time (~95%). We currently have 6 (4 x quad core)systems with approximately 24GB of ram each. We have attempted to addmore nodes, increase map tasktrackers (many different #s), change DFSblock size (32M, 64M, 128MB, 256M), LZO compression, and many, manyother configuration variables (io.sort.factor,io.sort.mb) without muchsuccess in lowering the time it takes to complete the count (I do noticea high IO wait on the nodes..no matter how many tasktrackers I run). Thesize of the DB is approximately ~200GB and with MySQL it takes a fewseconds to do both the 2.5B and 6.7B count (I am curious if running thislocally without any nodes would result in a quicker response time sincethe delay appears to be in the mapping stage...). I have come to believe(and read) that hadoop/hive is unfortunately not well suited for thistype of work and instead is suited for larger data sets. I am curious ifanyone has any ideas on A) improving performance and/or B) similarexperiences? I am also curious if maybe something like HBase would bebetter suited for this type of data (small dataset, many files). Weappreciate any input, suggestions, or ideas!.


Thank you!!
Paul Zimdars

Hadoop/Hive observations

Reply via email to