** Therotically, the streaming facility is as powerful as raw java M/R programs,less efficent but easier to use(argubly). It does support seconardy sort (DISTRIBUTED BY... SORT BY..). Though I do agree with you that not everything can be easily expressed in SQL but that's why they include TRANSFORM/streaming facility in Hive.
Regarding the scheduling part, my guess is it is beyond the scope of Hive , can you just use shell script? my 2 Qingc On Mon, Feb 23, 2009 at 11:59 AM, Min Zhou <[email protected]> wrote: > Hi Prasad , > > This is just streaming, a sort of tech how to complement the ability of > Hive sql. Sometimes this trick is also useless. For example, if I want to > do jobs like SecondarySort, can that way be okay? > My major intention is that I want to know how to schedule those two things, > hive and raw mapreduce. > > On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka <[email protected]>wrote: > >> You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE >> facilities. Check the wiki on how to use them. Or do you want something >> different? >> >> >> >> >> ------------------------------ >> *From: *Min Zhou <[email protected]> >> *Reply-To: *<[email protected]> >> *Date: *Sun, 22 Feb 2009 19:42:50 -0800 >> *To: *<[email protected]> >> *Subject: *How to simplify our development flow under the means of using >> Hive? >> >> >> Hi list, >> >> I'm goint to take Hive into production to analyze our web logs, which >> are hundreds of giga-bytes per day. Previously, we did this job by using >> Apache hadoop, running our raw mapreduce code. It did work, but it also >> decreased our productivity directly. We were suffering from writting code >> with similar logic again and again. It could be worse, when the format of >> our logs being changed. For example, when we want to insert one more field >> in each line of the log, the previous work would be useless, then we have to >> redo it. Hence we are thinking about using Hive as a persistent layer, to >> store and retrieve the schemes of the data easily. But we found that >> sometimes Hive could not do some sort of complex analysis, because of the >> limitation of the ideographic ability of SQL. We have to write our own >> UDFs, even though, some difficulties Hive still cannot go through. Thus we >> also need to write raw mapreduces code, which let us come up against >> another problem. Since one is a set of SQL scripts, the other is pieces of >> java or hybrid code, How to coordinate Hive and raw mapreduce code and how >> to shedule them? How does Facebook use Hive? And what is your solution when >> you come across the similar problems? >> >> In the end, we are considering about using Hive as our data warehouse. >> Any suggestions? >> >> Thanks in advance! >> Min >> >> -- >> My research interests are distributed systems, parallel computing and >> bytecode based virtual machine. >> >> http://coderplay.javaeye.com >> >> > > Regards, > > Min > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > http://coderplay.javaeye.com >
