Re: Hive and thrift session help

Edward Capriolo Tue, 08 Sep 2009 16:37:45 -0700

On Tue, Sep 8, 2009 at 7:02 PM, Edward Capriolo<[email protected]> wrote:
> On Tue, Sep 8, 2009 at 6:37 PM, Vijay<[email protected]> wrote:
>> I get that HWI does manage sessions but it does that leveraging the internal
>> functionality of the "server." One usage pattern I'd like is some kind of a
>> "job" API. What I mean by that is an API that lets us simply submit a query,
>> get some kind of "job id," and leave. After that we use other APIs to query
>> the job status, kill it, get the output once it is done, etc. If we have a
>> simple API like this and the semantics to support this within hive, then the
>> UI can be completely decoupled and be as stateless as it can (using vanilla
>> apache+php as an example, we can't really do threads or stay resident after
>> submitting a job). Does something like this exist either within hive or at
>> the hadoop level? It seems to me may be this is something that needs to be
>> built first.
>>
>> Thanks,
>> Vijay
>>
>> On Tue, Sep 8, 2009 at 2:52 PM, Edward Capriolo <[email protected]>
>> wrote:
>>>
>>> On Tue, Sep 8, 2009 at 5:15 PM, Royce
>>> Rollins<[email protected]> wrote:
>>> > OK I see. I just looked at the code in HWISessionManager.java.  So it
>>> > looks
>>> > like either I will have to write my own ruby HWISessionManager that
>>> > manages
>>> > sessions through thrift or expose the existng HWISessionManager via some
>>> > web
>>> > service interface.  Has anyone done this?
>>> >
>>> > Royce
>>> >
>>> >
>>> > On 9/8/09 1:47 PM, "Edward Capriolo" <[email protected]> wrote:
>>> >
>>> >> On Tue, Sep 8, 2009 at 4:38 PM, Vijay<[email protected]> wrote:
>>> >>> Sorry to inject into this thread but I have the same problem (only I'm
>>> >>> trying to use the thrift PHP libraries from apache-php scripts). The
>>> >>> problem
>>> >>> with this approach is that the http request cannot run indefinitely as
>>> >>> the
>>> >>> server is executing a query. Are there any solutions for this?
>>> >>>
>>> >>> Thanks,
>>> >>> Vijay
>>> >>>
>>> >>> On Tue, Sep 8, 2009 at 1:35 PM, Royce Rollins
>>> >>> <[email protected]>
>>> >>> wrote:
>>> >>>>
>>> >>>> Raghu,
>>> >>>> Thanks for the quick response.
>>> >>>> Yes.  My application is web based so instead of having to build some
>>> >>>> kind
>>> >>>> of
>>> >>>> session model myself for queries that might take a while,  I'd like
>>> >>>> to use
>>> >>>> a session model in the hive service.
>>> >>>>
>>> >>>> Royce
>>> >>>>
>>> >>>>
>>> >>>> On 9/8/09 1:32 PM, "Raghu Murthy" <[email protected]> wrote:
>>> >>>>
>>> >>>>> Our model so far has been to create a new connection to the hive
>>> >>>>> thrift
>>> >>>>> server per session. Is there anything specific you are looking for
>>> >>>>> in
>>> >>>>> sessions?
>>> >>>>>
>>> >>>>>
>>> >>>>> On 9/8/09 1:06 PM, "Royce Rollins" <[email protected]>
>>> >>>>> wrote:
>>> >>>>>
>>> >>>>>> I¹m curently working on an application that connects to hive via
>>> >>>>>> the
>>> >>>>>> thrift
>>> >>>>>> ruby libraries.
>>> >>>>>>
>>> >>>>>> Does hive support creation of sessions using those libraries.  If
>>> >>>>>> so,
>>> >>>>>> how?
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Royce
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>> >> Royce,
>>> >>
>>> >> The Hive Web Interface deals with this by having a threaded object
>>> >> (HWISessionManager) in the Web application scope. I am not sure if PHP
>>> >> has any equivalent to threading and Application Scope.
>>> >>
>>> >> Edward
>>> >
>>> >
>>>
>>> Someone correct me if I am wrong.
>>>
>>> Royce,
>>>
>>> You may be able to get at this another way. From my understanding, the
>>> internal hive web interface used at facebook would spawn ` bin/hive -e
>>> 'INSERT INTO X select * FROM`. All results were written to a hive
>>> table.
>>>
>>> Doing it this way gives you no way to interact with the query and
>>> 'stream' the result, set you can't really use 'fetchOne()' or
>>> 'fetchAll()' but you could start a query and set flags on completion.
>>>
>>> As for web interface, we just had some talks, and one of the things I
>>> was looking to do was create some type of web service style bindings.
>>> (We would also like to have HWI talk to Thrift and have thrift be the
>>> code path for everything). However, if we do make some web server
>>> style bindings they would really be independent of the back end. Do
>>> you want to work on this ? I would like to open a Jira and tackle the
>>> issue.
>>>
>>>
>>> The big picture here is that we need a 'state holder'. That is really
>>> what HWI is. You create a session, detach from it, and optionally
>>> check on it later. If an application needs that pattern how to handle
>>> it?
>>>
>>> One way to tackle this is
>>>
>>> INSERT INTO file 'hdfs://path/to/file' select * FROM XXX' &
>>>
>>> then have your client 'tail' the hdfs://path/to/file or record the
>>> last position it saw. I guess the big question is dealing with
>>> streaming results. HWI manages the session for you and writes the
>>> results to a local file, (and the new SessionBucket
>>>
>>> What is the usage pattern you need?
>>
>>
>
> Vijay,
>
>> What I mean by that is an API that lets us simply submit a query,
>> get some kind of "job id," and leave.
>
> No. (again someone correct me if I am wrong) As I under, if you
> disconnect from the Thrift HiveServer you can not reconnect.
>
> Assuming we punt on intermediate data (large queries with 10 TB of
> results waiting for client pickup). There are a few ways we (you)
> could handle this.
>
> You could use HWI as a web service. With some URL hacking like
> http://hwi:9999/hwi/create_session.jsp?name=bob
>
> This is not a true XML web service, but you could use it to accomplish
> your goals.
>
>> After that we use other APIs to query
>> the job status, kill it, get the output once it is done, etc
>
> We could write some other XMLRPC style JSP pages that would be a more
> formal web service.
>
> Hive Thrift Server could support this directly maybe with alternate
> constructors or objects for detached sessions.
>
> In summary
> option 1) URL hacking (you have that today, not very clean)
> option 2) web service bindings ( you could have that pretty fast, more
> clean does not have to touch anything upstream)
> option 3) detached sessions HiveServer ( patched HiveServer patched
> Hive Bindings, clean,)
>


It is an irony that you could have multiple 'hive -e'  running on the
same server, but with one JVM and thread locals/static variables have
had subtle issues.

Both stateful applications (hwi,hiveserver) struggle a bit as the API
was designed around the CLI. It would be interesting if the CLI could
even connect to a HiveServer or run a local HiveServer.

I opened up this issue: Create a Hive CLI that connects to hive ThriftServer
https://issues.apache.org/jira/browse/HIVE-818

Re: Hive and thrift session help

Reply via email to