Hi list,
I'm goint to take Hive into production to analyze our web logs, which
are hundreds of giga-bytes per day. Previously, we did this job by using
Apache hadoop, running our raw mapreduce code. It did work, but it also
decreased our productivity directly. We were suffering from writting code
with similar logic again and again. It could be worse, when the format of
our logs being changed. For example, when we want to insert one more field
in each line of the log, the previous work would be useless, then we have to
redo it. Hence we are thinking about using Hive as a persistent layer, to
store and retrieve the schemes of the data easily. But we found that
sometimes Hive could not do some sort of complex analysis, because of the
limitation of the ideographic ability of SQL. We have to write our own UDFs,
even though, some difficulties Hive still cannot go through. Thus we also
need to write raw mapreduces code, which let us come up against another
problem. Since one is a set of SQL scripts, the other is pieces of java or
hybrid code, How to coordinate Hive and raw mapreduce code and how to
shedule them? How does Facebook use Hive? And what is your solution when you
come across the similar problems?
In the end, we are considering about using Hive as our data warehouse.
Any suggestions?
Thanks in advance!
Min
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
http://coderplay.javaeye.com