Hello,

I created a new JIRA for this native implementation of the IO so feel
free to PR the 'native' implementation using this ticket.
https://issues.apache.org/jira/browse/BEAM-2357

We will discuss all the small details in the PR.

The old JIRA (BEAM-1158) will still be there just to add the read
example for HCatalog using HIFIO.

Regards,
Ismaël


On Wed, May 24, 2017 at 8:03 AM, Jean-Baptiste Onofré <[email protected]> wrote:
> Hi,
>
> It looks good. I just saw some issues:
>
> - javadoc is not correct in HiveIO (it says write() for read ;)).
> - estimated size is global to the table (doesn't consider the filter). It's
> not a big deal, but it should be documented.
> - you don't use the desired bundle size provided by the runner for the
> split. You are using the Hive split count, which is fine, just explain in
> the main javadoc maybe.
> - the reader should set current to null with nothing is read
> - getCurrent() should throw NoSuchElementException in case of current is
> null
> - in the writer, the flush should happen at the end of the batch as you did,
> but also when the bundle is finished
>
> Thanks !
> Great work
>
> Regards
> JB
>
>
> On 05/24/2017 01:36 AM, Seshadri Raghunathan wrote:
>>
>> Hi,
>>
>>
>> You can find a draft implementation of the same here :
>>
>>
>> HiveIO Source -
>> https://github.com/seshadri-cr/beam/commit/b74523c13e03dc70038bc1e348ce270fbb3fd99b
>>
>> HiveIO Sink -
>> https://github.com/seshadri-cr/beam/commit/0008f772a989c8cd817a99987a145fbf2f7fc795
>>
>>
>> Please let us know your comments and suggestions.
>>
>>
>> Regards,
>>
>> Seshadri
>>
>> 408 601 7548
>>
>>
>> From: Madhusudan Borkar [mailto:[email protected]]
>> Sent: Tuesday, May 23, 2017 3:12 PM
>> To: [email protected]; Seshadri Raghunathan <[email protected]>;
>> Rajesh Pandey <[email protected]>
>> Subject: [New Proposal] Hive connector using native api
>>
>>
>> Hi,
>>
>> HadoopIO can be used to read from Hive. It doesn't provide writing to
>> Hive. This new proposal for Hive connector includes both source and sink. It
>> uses Hive native api.
>>
>> Apache HCatalog provides way to read / write to hive without using
>> mapreduce. HCatReader reads data from cluster, using basic storage
>> abstraction of tables and rows. HCatWriter writes to cluster and a batching
>> process will be used to write in bulk. Please refer to Apache documentation
>> on HCatalog ReaderWriter
>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter
>>
>>
>> Solution:
>>
>> It will work like:
>>
>>
>> pipeline.apply(HiveIO.read()
>>
>> .withMetastoreUri("uri") //mandatory
>>
>> .withTable("myTable") //mandatory
>>
>> .withDatabase("myDb") //optional, assumes default if none specified
>>
>> .withPartition(“partition”) //optional,should be specified if the table is
>> partitioned
>>
>>
>> pipeline.apply(HiveIO.write()
>>
>> .withMetastoreUri("uri") //mandatory
>>
>> .withTable("myTable") //mandatory
>>
>> .withDatabase("myDb") //optional, assumes default if none specified
>>
>> .withPartition(“partition”) //optional
>>
>> .withBatchSize(size)) //optional
>>
>>
>> Please, let us know your comments and suggestions.
>>
>>
>>
>>
>> Madhu Borkar
>>
>>
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Reply via email to