Hi Joydeep, Thanks for your reply which help me a lot. So If I'd like to choose your first suggestion. How can I schedule my mapreduce code and Hive sql code? Because our jobs are batch-processing, we may often come across scenario like this, frist we run a raw mapreduce job,then the second Hive job take its output as input, and finally the third mapreduce job will deal remaining work. What is your solution then?
BTW, is it Hive only run as a thrift service in Facebook? On Mon, Feb 23, 2009 at 12:23 PM, Joydeep Sen Sarma <[email protected]>wrote: > Hi Min, > > > > One possibility is to have ur data sets stored in Hive – but for ur > map-reduce programs – use the Hive Java api's (to find input files for a > table, to extract rows from a table – etc.). That way at least the metadata > about all data is standardized in Hive. If you want to go down this route – > we can write up an example use of these APIs (which admittedly are not well > documented). > > > > The other option is for Hive to allow application code to takeover after a > small sql fragment (selects and where clause – perhaps with some udf). As a > crude example: > > > > Ø <some-harness> -hiveinput "select a.x, a.y+a.z from a where > a.country='HK'" –mapper <urmapper> -reducer <urreducer> and so on. > > > > Before we released Hive in open source – we had something like this > available inside Facebook for the prototype version of Hive – and I think it > would be fairly easy to resurrect it. For some reason in we haven't had much > reason to use it internally. Do you think this would be useful and what u > are looking for? > > > > > > BTW - The example u mention (secondary sort) – is supported by Hive > (distribute by … sort by …) – where the partitioning and sorting keys are > different. There may be some inefficiencies in Hive implementation compared > to hand-written code (we might have to duplicate data in keys and values for > this). Also we haven't allowed aggregate functions on this stream yet (this > is something we want to do in future). > > > > Joydeep > > > > > ------------------------------ > > *From:* Min Zhou [mailto:[email protected]] > *Sent:* Sunday, February 22, 2009 8:00 PM > *To:* [email protected] > *Subject:* Re: How to simplify our development flow under the means of > using Hive? > > > > Hi Prasad , > > This is just streaming, a sort of tech how to complement the ability of > Hive sql. Sometimes this trick is also useless. For example, if I want to > do jobs like SecondarySort, can that way be okay? > My major intention is that I want to know how to schedule those two things, > hive and raw mapreduce. > > On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka <[email protected]> > wrote: > > You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE > facilities. Check the wiki on how to use them. Or do you want something > different? > > > > ------------------------------ > > *From: *Min Zhou <[email protected]> > *Reply-To: *<[email protected]> > *Date: *Sun, 22 Feb 2009 19:42:50 -0800 > *To: *<[email protected]> > *Subject: *How to simplify our development flow under the means of using > Hive? > > > > Hi list, > > I'm goint to take Hive into production to analyze our web logs, which > are hundreds of giga-bytes per day. Previously, we did this job by using > Apache hadoop, running our raw mapreduce code. It did work, but it also > decreased our productivity directly. We were suffering from writting code > with similar logic again and again. It could be worse, when the format of > our logs being changed. For example, when we want to insert one more field > in each line of the log, the previous work would be useless, then we have to > redo it. Hence we are thinking about using Hive as a persistent layer, to > store and retrieve the schemes of the data easily. But we found that > sometimes Hive could not do some sort of complex analysis, because of the > limitation of the ideographic ability of SQL. We have to write our own > UDFs, even though, some difficulties Hive still cannot go through. Thus we > also need to write raw mapreduces code, which let us come up against > another problem. Since one is a set of SQL scripts, the other is pieces of > java or hybrid code, How to coordinate Hive and raw mapreduce code and how > to shedule them? How does Facebook use Hive? And what is your solution when > you come across the similar problems? > > In the end, we are considering about using Hive as our data warehouse. > Any suggestions? > > Thanks in advance! > Min > > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > http://coderplay.javaeye.com > > > > Regards, > Min > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > http://coderplay.javaeye.com > -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com
