RE: [New Proposal] Hive connector using native api

Seshadri Raghunathan Wed, 24 May 2017 09:50:36 -0700

Thanks all for your review and comments. I will raise a PR against 
https://issues.apache.org/jira/browse/BEAM-2357



Regards,
Seshadri
408 601 7548

-----Original Message-----
From: Ismaël Mejía [mailto:[email protected]] 
Sent: Wednesday, May 24, 2017 2:57 AM
To: [email protected]
Subject: Re: [New Proposal] Hive connector using native api

One quick thing I forgot to mention is that maybe it is a good idea for the 
guys working on the Beam SQL implementation to take a look at their needs for 
this IO considering that it could be quite useful to test the SQL (given the 
structured nature of HCatalog).


On Wed, May 24, 2017 at 11:54 AM, Ismaël Mejía <[email protected]> wrote:
> Hello,
>
> I created a new JIRA for this native implementation of the IO so feel 
> free to PR the 'native' implementation using this ticket.
> https://issues.apache.org/jira/browse/BEAM-2357
>
> We will discuss all the small details in the PR.
>
> The old JIRA (BEAM-1158) will still be there just to add the read 
> example for HCatalog using HIFIO.
>
> Regards,
> Ismaël
>
>
> On Wed, May 24, 2017 at 8:03 AM, Jean-Baptiste Onofré <[email protected]> 
> wrote:
>> Hi,
>>
>> It looks good. I just saw some issues:
>>
>> - javadoc is not correct in HiveIO (it says write() for read ;)).
>> - estimated size is global to the table (doesn't consider the 
>> filter). It's not a big deal, but it should be documented.
>> - you don't use the desired bundle size provided by the runner for 
>> the split. You are using the Hive split count, which is fine, just 
>> explain in the main javadoc maybe.
>> - the reader should set current to null with nothing is read
>> - getCurrent() should throw NoSuchElementException in case of current 
>> is null
>> - in the writer, the flush should happen at the end of the batch as 
>> you did, but also when the bundle is finished
>>
>> Thanks !
>> Great work
>>
>> Regards
>> JB
>>
>>
>> On 05/24/2017 01:36 AM, Seshadri Raghunathan wrote:
>>>
>>> Hi,
>>>
>>>
>>> You can find a draft implementation of the same here :
>>>
>>>
>>> HiveIO Source -
>>> https://github.com/seshadri-cr/beam/commit/b74523c13e03dc70038bc1e34
>>> 8ce270fbb3fd99b
>>>
>>> HiveIO Sink -
>>> https://github.com/seshadri-cr/beam/commit/0008f772a989c8cd817a99987
>>> a145fbf2f7fc795
>>>
>>>
>>> Please let us know your comments and suggestions.
>>>
>>>
>>> Regards,
>>>
>>> Seshadri
>>>
>>> 408 601 7548
>>>
>>>
>>> From: Madhusudan Borkar [mailto:[email protected]]
>>> Sent: Tuesday, May 23, 2017 3:12 PM
>>> To: [email protected]; Seshadri Raghunathan 
>>> <[email protected]>; Rajesh Pandey <[email protected]>
>>> Subject: [New Proposal] Hive connector using native api
>>>
>>>
>>> Hi,
>>>
>>> HadoopIO can be used to read from Hive. It doesn't provide writing 
>>> to Hive. This new proposal for Hive connector includes both source 
>>> and sink. It uses Hive native api.
>>>
>>> Apache HCatalog provides way to read / write to hive without using 
>>> mapreduce. HCatReader reads data from cluster, using basic storage 
>>> abstraction of tables and rows. HCatWriter writes to cluster and a 
>>> batching process will be used to write in bulk. Please refer to 
>>> Apache documentation on HCatalog ReaderWriter 
>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWrit
>>> er
>>>
>>>
>>> Solution:
>>>
>>> It will work like:
>>>
>>>
>>> pipeline.apply(HiveIO.read()
>>>
>>> .withMetastoreUri("uri") //mandatory
>>>
>>> .withTable("myTable") //mandatory
>>>
>>> .withDatabase("myDb") //optional, assumes default if none specified
>>>
>>> .withPartition(“partition”) //optional,should be specified if the 
>>> table is partitioned
>>>
>>>
>>> pipeline.apply(HiveIO.write()
>>>
>>> .withMetastoreUri("uri") //mandatory
>>>
>>> .withTable("myTable") //mandatory
>>>
>>> .withDatabase("myDb") //optional, assumes default if none specified
>>>
>>> .withPartition(“partition”) //optional
>>>
>>> .withBatchSize(size)) //optional
>>>
>>>
>>> Please, let us know your comments and suggestions.
>>>
>>>
>>>
>>>
>>> Madhu Borkar
>>>
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

RE: [New Proposal] Hive connector using native api

Reply via email to