Hello,

I created the HiveIO JIRA and followed the initial discussions about
the best approach for HiveIO so I want first to suggest you to read
the previous thread(s) on the mailing list.

https://www.mail-archive.com/[email protected]/msg02313.html

The main idea I concluded from that thread is that a really valuable
part of accesing Hive for Beam is to access the records exposed via
the catalog of the data using HCatalog. This approach is way more
interesting because Beam can benefit of the multiple runner execution
to process the data exposed by Hive in all the different runners. This
is not the case if we invoke HiveQL (or SQL) queries over Hive via Map
Reduce. Note also that you can do this today on Beam by using the
JdbcIO + the specific Hive JDBC configuration.

It is probably a good idea that you take a look at how the Flink
connector does this, because it is essentially the same idea that we
want.
https://github.com/apache/flink/tree/master/flink-connectors/flink-hcatalog
(Note try to not get confused by the name of the classes on Flink vs
Hive because they are really similar).

Also take a look at HadoopIO because you can make a simpler
implementation by reusing the code that is there because
HCatInputFormat is a Hadoop InputFormat class.

So the idea at least for the read part would be to build a
PCollection<HCatRecord> from a Hadoop Configuration + the database
name + the table + eventually a filter and this Pcollection will be
processed on the Beam Pipelines, the advantage of this approach is
that once the Beam SQL DSL is ready it will integrate perfectly with
this IO so we can have SQL reading from Hive/Hcatalog and processing
on whatever runner the users want.

Finally if you agree with this approach I think that probably it makes
sense to rename the IO into HCatalogIO as Flink does,

One extra thing I have still not looked at the write part but I
suppose that it should be something similar.

Regards,
Ismael.

Reply via email to