On Tue, Sep 8, 2009 at 7:02 PM, Edward Capriolo<[email protected]> wrote: > On Tue, Sep 8, 2009 at 6:37 PM, Vijay<[email protected]> wrote: >> I get that HWI does manage sessions but it does that leveraging the internal >> functionality of the "server." One usage pattern I'd like is some kind of a >> "job" API. What I mean by that is an API that lets us simply submit a query, >> get some kind of "job id," and leave. After that we use other APIs to query >> the job status, kill it, get the output once it is done, etc. If we have a >> simple API like this and the semantics to support this within hive, then the >> UI can be completely decoupled and be as stateless as it can (using vanilla >> apache+php as an example, we can't really do threads or stay resident after >> submitting a job). Does something like this exist either within hive or at >> the hadoop level? It seems to me may be this is something that needs to be >> built first. >> >> Thanks, >> Vijay >> >> On Tue, Sep 8, 2009 at 2:52 PM, Edward Capriolo <[email protected]> >> wrote: >>> >>> On Tue, Sep 8, 2009 at 5:15 PM, Royce >>> Rollins<[email protected]> wrote: >>> > OK I see. I just looked at the code in HWISessionManager.java. So it >>> > looks >>> > like either I will have to write my own ruby HWISessionManager that >>> > manages >>> > sessions through thrift or expose the existng HWISessionManager via some >>> > web >>> > service interface. Has anyone done this? >>> > >>> > Royce >>> > >>> > >>> > On 9/8/09 1:47 PM, "Edward Capriolo" <[email protected]> wrote: >>> > >>> >> On Tue, Sep 8, 2009 at 4:38 PM, Vijay<[email protected]> wrote: >>> >>> Sorry to inject into this thread but I have the same problem (only I'm >>> >>> trying to use the thrift PHP libraries from apache-php scripts). The >>> >>> problem >>> >>> with this approach is that the http request cannot run indefinitely as >>> >>> the >>> >>> server is executing a query. Are there any solutions for this? >>> >>> >>> >>> Thanks, >>> >>> Vijay >>> >>> >>> >>> On Tue, Sep 8, 2009 at 1:35 PM, Royce Rollins >>> >>> <[email protected]> >>> >>> wrote: >>> >>>> >>> >>>> Raghu, >>> >>>> Thanks for the quick response. >>> >>>> Yes. My application is web based so instead of having to build some >>> >>>> kind >>> >>>> of >>> >>>> session model myself for queries that might take a while, I'd like >>> >>>> to use >>> >>>> a session model in the hive service. >>> >>>> >>> >>>> Royce >>> >>>> >>> >>>> >>> >>>> On 9/8/09 1:32 PM, "Raghu Murthy" <[email protected]> wrote: >>> >>>> >>> >>>>> Our model so far has been to create a new connection to the hive >>> >>>>> thrift >>> >>>>> server per session. Is there anything specific you are looking for >>> >>>>> in >>> >>>>> sessions? >>> >>>>> >>> >>>>> >>> >>>>> On 9/8/09 1:06 PM, "Royce Rollins" <[email protected]> >>> >>>>> wrote: >>> >>>>> >>> >>>>>> I¹m curently working on an application that connects to hive via >>> >>>>>> the >>> >>>>>> thrift >>> >>>>>> ruby libraries. >>> >>>>>> >>> >>>>>> Does hive support creation of sessions using those libraries. If >>> >>>>>> so, >>> >>>>>> how? >>> >>>>>> >>> >>>>>> >>> >>>>>> Royce >>> >>>>> >>> >>>> >>> >>> >>> >>> >>> >> >>> >> Royce, >>> >> >>> >> The Hive Web Interface deals with this by having a threaded object >>> >> (HWISessionManager) in the Web application scope. I am not sure if PHP >>> >> has any equivalent to threading and Application Scope. >>> >> >>> >> Edward >>> > >>> > >>> >>> Someone correct me if I am wrong. >>> >>> Royce, >>> >>> You may be able to get at this another way. From my understanding, the >>> internal hive web interface used at facebook would spawn ` bin/hive -e >>> 'INSERT INTO X select * FROM`. All results were written to a hive >>> table. >>> >>> Doing it this way gives you no way to interact with the query and >>> 'stream' the result, set you can't really use 'fetchOne()' or >>> 'fetchAll()' but you could start a query and set flags on completion. >>> >>> As for web interface, we just had some talks, and one of the things I >>> was looking to do was create some type of web service style bindings. >>> (We would also like to have HWI talk to Thrift and have thrift be the >>> code path for everything). However, if we do make some web server >>> style bindings they would really be independent of the back end. Do >>> you want to work on this ? I would like to open a Jira and tackle the >>> issue. >>> >>> >>> The big picture here is that we need a 'state holder'. That is really >>> what HWI is. You create a session, detach from it, and optionally >>> check on it later. If an application needs that pattern how to handle >>> it? >>> >>> One way to tackle this is >>> >>> INSERT INTO file 'hdfs://path/to/file' select * FROM XXX' & >>> >>> then have your client 'tail' the hdfs://path/to/file or record the >>> last position it saw. I guess the big question is dealing with >>> streaming results. HWI manages the session for you and writes the >>> results to a local file, (and the new SessionBucket >>> >>> What is the usage pattern you need? >> >> > > Vijay, > >> What I mean by that is an API that lets us simply submit a query, >> get some kind of "job id," and leave. > > No. (again someone correct me if I am wrong) As I under, if you > disconnect from the Thrift HiveServer you can not reconnect. > > Assuming we punt on intermediate data (large queries with 10 TB of > results waiting for client pickup). There are a few ways we (you) > could handle this. > > You could use HWI as a web service. With some URL hacking like > http://hwi:9999/hwi/create_session.jsp?name=bob > > This is not a true XML web service, but you could use it to accomplish > your goals. > >> After that we use other APIs to query >> the job status, kill it, get the output once it is done, etc > > We could write some other XMLRPC style JSP pages that would be a more > formal web service. > > Hive Thrift Server could support this directly maybe with alternate > constructors or objects for detached sessions. > > In summary > option 1) URL hacking (you have that today, not very clean) > option 2) web service bindings ( you could have that pretty fast, more > clean does not have to touch anything upstream) > option 3) detached sessions HiveServer ( patched HiveServer patched > Hive Bindings, clean,) >
It is an irony that you could have multiple 'hive -e' running on the same server, but with one JVM and thread locals/static variables have had subtle issues. Both stateful applications (hwi,hiveserver) struggle a bit as the API was designed around the CLI. It would be interesting if the CLI could even connect to a HiveServer or run a local HiveServer. I opened up this issue: Create a Hive CLI that connects to hive ThriftServer https://issues.apache.org/jira/browse/HIVE-818
