Re: How to simplify our development flow under the means of using Hive?

Min Zhou Mon, 23 Feb 2009 18:15:57 -0800

Hi Joydeep,

What drive your batch-processing jobs to work?  Data? or a crontab script?
or your shell script?



On Tue, Feb 24, 2009 at 9:56 AM, Joydeep Sen Sarma <[email protected]>wrote:

>  The scenario below is pretty simple actually.
>
>
>
> U can make the first job write to a hdfs directory. Then create a temporary
> table (using create external table) over this directory. Run the Hive sql
> over it and write the results to another hdfs directory (using insert
> overwrite directory). And then run ur last map-reduce job on the previous
> directory.
>
>
>
> So this doesn't require the use of any Java APIs.
>
>
>
> That said – we can write up a simple demo program and build environment to
> write map-reduce on Hive tables. Javadocs may need a lot of work though ..
>
>
>  ------------------------------
>
> *From:* Min Zhou [mailto:[email protected]]
> *Sent:* Monday, February 23, 2009 5:43 PM
>
> *To:* [email protected]
> *Subject:* Re: How to simplify our development flow under the means of
> using Hive?
>
>
>
> Hi Joydeep,
>
> Thanks for your reply which help me a lot. So If I'd like to choose your
> first suggestion.  How can I schedule my mapreduce code and Hive sql code?
> Because our jobs are batch-processing, we may often come across scenario
> like this,  frist we run a raw mapreduce job,then the second Hive job take
> its output as input, and finally the third mapreduce job will deal
> remaining work.  What is your solution then?
>
> BTW,  is it  Hive only run as a thrift service in Facebook?
>
>  On Mon, Feb 23, 2009 at 12:23 PM, Joydeep Sen Sarma <[email protected]>
> wrote:
>
> Hi Min,
>
>
>
> One possibility is to have ur data sets stored in Hive – but for ur
> map-reduce programs – use the Hive Java api's (to find input files for a
> table, to extract rows from a table – etc.). That way at least the metadata
> about all data is standardized in Hive. If you want to go down this route –
> we can write up an example use of these APIs (which admittedly are not well
> documented).
>
>
>
> The other option is for Hive to allow application code to takeover after a
> small sql fragment (selects and where clause – perhaps with some udf). As a
> crude example:
>
>
>
> Ø       <some-harness> -hiveinput "select a.x, a.y+a.z from a where
> a.country='HK'" –mapper <urmapper> -reducer <urreducer> and so on.
>
>
>
> Before we released Hive in open source – we had something like this
> available inside Facebook for the prototype version of Hive – and I think it
> would be fairly easy to resurrect it. For some reason in we haven't had much
> reason to use it internally. Do you think this would be useful and what u
> are looking for?
>
>
>
>
>
> BTW - The example u mention (secondary sort) – is supported by Hive
> (distribute by … sort by …) – where the partitioning and sorting keys are
> different. There may be some inefficiencies in Hive implementation compared
> to hand-written code (we might have to duplicate data in keys and values for
> this). Also we haven't allowed aggregate functions on this stream yet (this
> is something we want to do in future).
>
>
>
> Joydeep
>
>
>
>
>  ------------------------------
>
> *From:* Min Zhou [mailto:[email protected]]
> *Sent:* Sunday, February 22, 2009 8:00 PM
>
>
> *To:* [email protected]
>
> *Subject:* Re: How to simplify our development flow under the means of
> using Hive?
>
>
>
> Hi Prasad ,
>
> This is just streaming,  a sort of tech how to complement the ability of
> Hive sql.  Sometimes this trick is also useless. For example,  if I want to
> do jobs like SecondarySort,  can that way be okay?
> My major intention is that I want to know how to schedule those two things,
> hive and raw mapreduce.
>
> On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka <[email protected]>
> wrote:
>
> You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE
> facilities. Check the wiki on how to use them. Or do you want something
> different?
>
>
>  ------------------------------
>
> *From: *Min Zhou <[email protected]>
> *Reply-To: *<[email protected]>
> *Date: *Sun, 22 Feb 2009 19:42:50 -0800
> *To: *<[email protected]>
> *Subject: *How to simplify our development flow under the means of using
> Hive?
>
>
>
> Hi list,
>
>     I'm goint to take Hive into production to analyze our web logs, which
> are hundreds of  giga-bytes per day. Previously, we did this job by using
> Apache hadoop, running our raw mapreduce code. It did work, but it also
> decreased our productivity directly. We were suffering from writting code
> with similar logic again and again. It could be worse, when the format of
> our logs being changed. For example, when we want to insert one more field
> in each line of the log, the previous work would be useless, then we have to
> redo it. Hence we are thinking about using Hive as a persistent layer, to
> store and retrieve the schemes of the data easily. But we found that
> sometimes Hive could not do some sort of complex analysis, because of the
> limitation of the ideographic ability of SQL.   We have to write our own
> UDFs, even though, some difficulties Hive still cannot go through.  Thus we
> also need to write raw mapreduces code,  which let us come up against
> another problem.  Since one is a set of SQL scripts, the other is pieces of
> java or hybrid code, How to coordinate  Hive and raw mapreduce code and how
> to shedule them? How does Facebook use Hive? And what is your solution when
> you come across the similar problems?
>
>     In the end, we are considering about using Hive as our data warehouse.
> Any suggestions?
>
> Thanks in advance!
> Min
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> http://coderplay.javaeye.com
>
>
>
> Regards,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> http://coderplay.javaeye.com
>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> http://coderplay.javaeye.com
>

Thanks,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

http://coderplay.javaeye.com

Re: How to simplify our development flow under the means of using Hive?

Reply via email to