Re: job level output committer in storage handler

Ashutosh Chauhan Wed, 26 May 2010 18:51:40 -0700

Oh cool.. I will try that out.

Thanks,
Ashutosh


On Wed, May 26, 2010 at 18:46, Ashish Thusoo <[email protected]> wrote:
> Actually if you want to do that then I believe you can check in the post 
> execute hook that you have a valid write entity that is of the type table or 
> partition. You should have that only in the case of an insert or a CTAS.
>
> Ashish
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:[email protected]]
> Sent: Wednesday, May 26, 2010 6:36 PM
> To: [email protected]
> Subject: Re: job level output committer in storage handler
>
> Thanks everyone for the reply. I think its HIVE-1225 is really what I want.
> At this point I can implement PostExecute as I need to call the hook only at 
> the end of query and not at the end of each job or task of query.  If I 
> register it through hive-site.xml then I guess it will get executed for each 
> query which is where the complication starts.  I want to execute this hook 
> only for insert queries and not for all the queries.  One workaround is to 
> get the Cmd string from session and then parse it to find out if it actually 
> is an insert query and only if it is then execute the remainder of code.
> But that looks hacky, I look forward to HIVE-1225.
>
> Thanks,
> Ashutosh
>
> On Wed, May 26, 2010 at 10:35, John Sichi <[email protected]> wrote:
>> I think we'll need to extend the StorageHandler interface so that it can 
>> participate in the commit semantics (separate from the handler-independent 
>> hooks Ashish mentioned).  That was the intention of this followup JIRA issue 
>> I logged as part of HBase integration work:
>>
>> https://issues.apache.org/jira/browse/HIVE-1225
>>
>> To add this one, we need to determine what information needs to be passed 
>> along to the storage handler now (and how to make it easy to pass along more 
>> information as needed without having to change the interface in the future).
>>
>> JVS
>>
>> ________________________________________
>> From: Ning Zhang [[email protected]]
>> Sent: Wednesday, May 26, 2010 10:22 AM
>> To: [email protected]
>> Subject: Re: job level output committer in storage handler
>>
>> Hi Ashutosh,
>>
>> Hive doesn't use OutputCommitter explicitly because it handles commit and 
>> abort by itself.
>>
>> If you are looking for task level committer where you want to do something 
>> after a task successfully finished, you can take a look at the 
>> FileSinkOperator.cloaseOp(). It renames tempFile to final file name which 
>> implement the commit semantics.
>>
>> If you are looking for job level committer where you want to do something 
>> after the job (including all task) finished successfully, you can take a 
>> look at the MoveTask implementation. The MoveTask is generated as a follow 
>> up task after a MR job for each insert overwrite statement. It moves the 
>> directory that contains the results from all finished tasks to its 
>> destination path (e.g. a directory specified in the insert statement or 
>> inferred from the table's storage location property). The MoveTask 
>> implements the commit semantics of the whole job.
>>
>> Ning
>>
>> On May 26, 2010, at 9:16 AM, Ashutosh Chauhan wrote:
>>
>>> Hi Kortni,
>>>
>>> Thanks for your suggestion. But we cant use it in our setup. We are
>>> not spinning hive jobs in a separate process which we can monitor
>>> rather I want to get the handle on when job finishes in my storage
>>> handler / serde.
>>>
>>> Ashutosh
>>>
>>> On Tue, May 25, 2010 at 12:25, Kortni Smith <[email protected]> wrote:
>>>> Hi Ashutosh ,
>>>>
>>>> I'm not sure how to accomplish that on the hive side of things, but
>>>> in case it helps I am writing because it sounds like you to know
>>>> when your job is done so you can update something externally and my
>>>> company will also be implementing this in the near future.  Our plan
>>>> is to have the process that kicks off our hive jobs in the cloud, to
>>>> monitor each job status periodically using amazon's emr java
>>>> library, and when their state changes to complete, update our external 
>>>> systems accordingly.
>>>>
>>>>
>>>> Kortni Smith | Software Developer
>>>> AbeBooks.com  Passion for books.
>>>>
>>>> [email protected]
>>>> phone: 250.412.3272  |  fax: 250.475.6014
>>>>
>>>> Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5
>>>>
>>>> www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
>>>> www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com
>>>>
>>>> -----Original Message-----
>>>> From: Ashutosh Chauhan [mailto:[email protected]]
>>>> Sent: Tuesday, May 25, 2010 12:13 PM
>>>> To: [email protected]
>>>> Subject: job level output committer in storage handler
>>>>
>>>> Hi,
>>>>
>>>> I am implementing my own serde and storage handler. Is there any
>>>> method in one of these interfaces (or any other) which give me a
>>>> handle to do some operation after all the records have been written
>>>> by all reducer.  Something very similar to job level output
>>>> committer. I want to update some state in an external system once I
>>>> know job has completed successfully. Ideally, I would do this kind
>>>> of a thing in a job level output committer, but since Hive is on old
>>>> MR api, I dont have access to that.  There is a Hive's
>>>> RecordWriter#close() I tried that but it looks like its a task level
>>>> handle. So, every reducer will try to update the state of my external 
>>>> system, which is not I want.
>>>> Any pointers on how to achieve this will be much appreciated. If its
>>>> unclear what I am asking for, let me know and I will provide more
>>>> details.
>>>>
>>>> Thanks,
>>>> Ashutosh
>>>>
>>
>>
>

Re: job level output committer in storage handler

Reply via email to