Hi All,
We have been using Hadoop (0.20.2+320) and Hive (0.5.0+20) for about a
month now to see if we could migrate our existing MySQL DB into a
Hadoop/Hive architecture (hadoop/hive rock BTW! :) ). We unfortunately
are experiencing slow response times while doing simple tasks such as a
count DB query (e.g. hive> select count(blah_id) from blah;). We
currently have 2.5B Data Points residing in a single table and hive will
take approximately 5-6 minutes to do a count of these 2.5B records
(15-17 minutes for 6.8B records). The reduce portion is fast (single
reduce since this is a count * query) but the map stage takes the
remainder of the time (~95%). We currently have 6 (4 x quad core)
systems with approximately 24GB of ram each. We have attempted to add
more nodes, increase map tasktrackers (many different #s), change DFS
block size (32M, 64M, 128MB, 256M), LZO compression, and many, many
other configuration variables (io.sort.factor,io.sort.mb) without much
success in lowering the time it takes to complete the count (I do notice
a high IO wait on the nodes..no matter how many tasktrackers I run). The
size of the DB is approximately ~200GB and with MySQL it takes a few
seconds to do both the 2.5B and 6.7B count (I am curious if running this
locally without any nodes would result in a quicker response time since
the delay appears to be in the mapping stage...). I have come to believe
(and read) that hadoop/hive is unfortunately not well suited for this
type of work and instead is suited for larger data sets. I am curious if
anyone has any ideas on A) improving performance and/or B) similar
experiences? I am also curious if maybe something like HBase would be
better suited for this type of data (small dataset, many files). We
appreciate any input, suggestions, or ideas!.
Thank you!!
Paul Zimdars
- Hadoop/Hive observations Paul Zimdars
-