Hi,
You can find a draft implementation of the same here : HiveIO Source - https://github.com/seshadri-cr/beam/commit/b74523c13e03dc70038bc1e348ce270fbb3fd99b HiveIO Sink - https://github.com/seshadri-cr/beam/commit/0008f772a989c8cd817a99987a145fbf2f7fc795 Please let us know your comments and suggestions. Regards, Seshadri 408 601 7548 From: Madhusudan Borkar [mailto:[email protected]] Sent: Tuesday, May 23, 2017 3:12 PM To: [email protected]; Seshadri Raghunathan <[email protected]>; Rajesh Pandey <[email protected]> Subject: [New Proposal] Hive connector using native api Hi, HadoopIO can be used to read from Hive. It doesn't provide writing to Hive. This new proposal for Hive connector includes both source and sink. It uses Hive native api. Apache HCatalog provides way to read / write to hive without using mapreduce. HCatReader reads data from cluster, using basic storage abstraction of tables and rows. HCatWriter writes to cluster and a batching process will be used to write in bulk. Please refer to Apache documentation on HCatalog ReaderWriter https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter Solution: It will work like: pipeline.apply(HiveIO.read() .withMetastoreUri("uri") //mandatory .withTable("myTable") //mandatory .withDatabase("myDb") //optional, assumes default if none specified .withPartition(“partition”) //optional,should be specified if the table is partitioned pipeline.apply(HiveIO.write() .withMetastoreUri("uri") //mandatory .withTable("myTable") //mandatory .withDatabase("myDb") //optional, assumes default if none specified .withPartition(“partition”) //optional .withBatchSize(size)) //optional Please, let us know your comments and suggestions. Madhu Borkar
