Hi,

 

You can find a draft implementation of the same here :

 

HiveIO Source - 
https://github.com/seshadri-cr/beam/commit/b74523c13e03dc70038bc1e348ce270fbb3fd99b

HiveIO Sink - 
https://github.com/seshadri-cr/beam/commit/0008f772a989c8cd817a99987a145fbf2f7fc795

 

Please let us know your comments and suggestions.

 

Regards,

Seshadri

408 601 7548

 

From: Madhusudan Borkar [mailto:[email protected]] 
Sent: Tuesday, May 23, 2017 3:12 PM
To: [email protected]; Seshadri Raghunathan <[email protected]>; Rajesh 
Pandey <[email protected]>
Subject: [New Proposal] Hive connector using native api

 

Hi,

HadoopIO can be used to read from Hive. It doesn't provide writing to Hive. 
This new proposal for Hive connector includes both source and sink. It uses 
Hive native api.

Apache HCatalog provides way to read / write to hive without using mapreduce. 
HCatReader reads data from cluster, using basic storage abstraction of tables 
and rows. HCatWriter writes to cluster and a batching process will be used to 
write in bulk. Please refer to Apache documentation on HCatalog ReaderWriter 
https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter 

 

Solution: 

It will work like:

 

pipeline.apply(HiveIO.read()

.withMetastoreUri("uri") //mandatory

.withTable("myTable") //mandatory

.withDatabase("myDb") //optional, assumes default if none specified

.withPartition(“partition”) //optional,should be specified if the table is 
partitioned

 

pipeline.apply(HiveIO.write()

.withMetastoreUri("uri") //mandatory

.withTable("myTable") //mandatory

.withDatabase("myDb") //optional, assumes default if none specified

.withPartition(“partition”) //optional

.withBatchSize(size)) //optional

 

Please, let us know your comments and suggestions.




Madhu Borkar

Reply via email to