>From the point of view of general source/sink development, this code looks reasonable, except for a few violations of https://beam.apache.org/contribute/ptransform-style-guide/ (mainly around https://beam.apache.org/contribute/ptransform-style-guide/#runtime-errors-and-data-consistency) and other easily fixable things.
Would be good to get some extra input from people experienced with Hive on whether this is the right API or if there are any pitfalls to avoid. Feel free to send a PR! Thanks! On Tue, May 23, 2017 at 4:36 PM Seshadri Raghunathan < [email protected]> wrote: > Hi, > > > > You can find a draft implementation of the same here : > > > > HiveIO Source - > https://github.com/seshadri-cr/beam/commit/b74523c13e03dc70038bc1e348ce270fbb3fd99b > > HiveIO Sink - > https://github.com/seshadri-cr/beam/commit/0008f772a989c8cd817a99987a145fbf2f7fc795 > > > > Please let us know your comments and suggestions. > > > > Regards, > > Seshadri > > 408 601 7548 <(408)%20601-7548> > > > > From: Madhusudan Borkar [mailto:[email protected]] > Sent: Tuesday, May 23, 2017 3:12 PM > To: [email protected]; Seshadri Raghunathan <[email protected]>; > Rajesh Pandey <[email protected]> > Subject: [New Proposal] Hive connector using native api > > > > Hi, > > HadoopIO can be used to read from Hive. It doesn't provide writing to > Hive. This new proposal for Hive connector includes both source and sink. > It uses Hive native api. > > Apache HCatalog provides way to read / write to hive without using > mapreduce. HCatReader reads data from cluster, using basic storage > abstraction of tables and rows. HCatWriter writes to cluster and a batching > process will be used to write in bulk. Please refer to Apache documentation > on HCatalog ReaderWriter > https://cwiki.apache.org/confluence/display/Hive/HCatalog+ReaderWriter > > > > Solution: > > It will work like: > > > > pipeline.apply(HiveIO.read() > > .withMetastoreUri("uri") //mandatory > > .withTable("myTable") //mandatory > > .withDatabase("myDb") //optional, assumes default if none specified > > .withPartition(“partition”) //optional,should be specified if the table is > partitioned > > > > pipeline.apply(HiveIO.write() > > .withMetastoreUri("uri") //mandatory > > .withTable("myTable") //mandatory > > .withDatabase("myDb") //optional, assumes default if none specified > > .withPartition(“partition”) //optional > > .withBatchSize(size)) //optional > > > > Please, let us know your comments and suggestions. > > > > > Madhu Borkar > >
