Hi Joydeep, What drive your batch-processing jobs to work? Data? or a crontab script? or your shell script?
On Tue, Feb 24, 2009 at 9:56 AM, Joydeep Sen Sarma <[email protected]>wrote: > The scenario below is pretty simple actually. > > > > U can make the first job write to a hdfs directory. Then create a temporary > table (using create external table) over this directory. Run the Hive sql > over it and write the results to another hdfs directory (using insert > overwrite directory). And then run ur last map-reduce job on the previous > directory. > > > > So this doesn't require the use of any Java APIs. > > > > That said – we can write up a simple demo program and build environment to > write map-reduce on Hive tables. Javadocs may need a lot of work though .. > > > ------------------------------ > > *From:* Min Zhou [mailto:[email protected]] > *Sent:* Monday, February 23, 2009 5:43 PM > > *To:* [email protected] > *Subject:* Re: How to simplify our development flow under the means of > using Hive? > > > > Hi Joydeep, > > Thanks for your reply which help me a lot. So If I'd like to choose your > first suggestion. How can I schedule my mapreduce code and Hive sql code? > Because our jobs are batch-processing, we may often come across scenario > like this, frist we run a raw mapreduce job,then the second Hive job take > its output as input, and finally the third mapreduce job will deal > remaining work. What is your solution then? > > BTW, is it Hive only run as a thrift service in Facebook? > > On Mon, Feb 23, 2009 at 12:23 PM, Joydeep Sen Sarma <[email protected]> > wrote: > > Hi Min, > > > > One possibility is to have ur data sets stored in Hive – but for ur > map-reduce programs – use the Hive Java api's (to find input files for a > table, to extract rows from a table – etc.). That way at least the metadata > about all data is standardized in Hive. If you want to go down this route – > we can write up an example use of these APIs (which admittedly are not well > documented). > > > > The other option is for Hive to allow application code to takeover after a > small sql fragment (selects and where clause – perhaps with some udf). As a > crude example: > > > > Ø <some-harness> -hiveinput "select a.x, a.y+a.z from a where > a.country='HK'" –mapper <urmapper> -reducer <urreducer> and so on. > > > > Before we released Hive in open source – we had something like this > available inside Facebook for the prototype version of Hive – and I think it > would be fairly easy to resurrect it. For some reason in we haven't had much > reason to use it internally. Do you think this would be useful and what u > are looking for? > > > > > > BTW - The example u mention (secondary sort) – is supported by Hive > (distribute by … sort by …) – where the partitioning and sorting keys are > different. There may be some inefficiencies in Hive implementation compared > to hand-written code (we might have to duplicate data in keys and values for > this). Also we haven't allowed aggregate functions on this stream yet (this > is something we want to do in future). > > > > Joydeep > > > > > ------------------------------ > > *From:* Min Zhou [mailto:[email protected]] > *Sent:* Sunday, February 22, 2009 8:00 PM > > > *To:* [email protected] > > *Subject:* Re: How to simplify our development flow under the means of > using Hive? > > > > Hi Prasad , > > This is just streaming, a sort of tech how to complement the ability of > Hive sql. Sometimes this trick is also useless. For example, if I want to > do jobs like SecondarySort, can that way be okay? > My major intention is that I want to know how to schedule those two things, > hive and raw mapreduce. > > On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka <[email protected]> > wrote: > > You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE > facilities. Check the wiki on how to use them. Or do you want something > different? > > > ------------------------------ > > *From: *Min Zhou <[email protected]> > *Reply-To: *<[email protected]> > *Date: *Sun, 22 Feb 2009 19:42:50 -0800 > *To: *<[email protected]> > *Subject: *How to simplify our development flow under the means of using > Hive? > > > > Hi list, > > I'm goint to take Hive into production to analyze our web logs, which > are hundreds of giga-bytes per day. Previously, we did this job by using > Apache hadoop, running our raw mapreduce code. It did work, but it also > decreased our productivity directly. We were suffering from writting code > with similar logic again and again. It could be worse, when the format of > our logs being changed. For example, when we want to insert one more field > in each line of the log, the previous work would be useless, then we have to > redo it. Hence we are thinking about using Hive as a persistent layer, to > store and retrieve the schemes of the data easily. But we found that > sometimes Hive could not do some sort of complex analysis, because of the > limitation of the ideographic ability of SQL. We have to write our own > UDFs, even though, some difficulties Hive still cannot go through. Thus we > also need to write raw mapreduces code, which let us come up against > another problem. Since one is a set of SQL scripts, the other is pieces of > java or hybrid code, How to coordinate Hive and raw mapreduce code and how > to shedule them? How does Facebook use Hive? And what is your solution when > you come across the similar problems? > > In the end, we are considering about using Hive as our data warehouse. > Any suggestions? > > Thanks in advance! > Min > > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > http://coderplay.javaeye.com > > > > Regards, > Min > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > http://coderplay.javaeye.com > > > > > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > http://coderplay.javaeye.com > Thanks, Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com
