Thanks Anton for contributing it! It's a big progress that BeamSQL can connect to Hive metastore! The HCatalogTableProvider implementation is also a good reference for people who want to implement table provider for their metastore serivces.
Just add another design discussion that I am aware of: Figure it out what's the better way to manage autosevice table provider registration approach and DDL approach in JDBC driver code path. -Rui On Thu, Feb 14, 2019 at 11:42 AM Anton Kedin <[email protected]> wrote: > Hi dev@, > > A quick update about a new Beam SQL feature. > > In short, we have wired up the support for plugging table providers > through Beam SQL API to allow obtaining table schemas from external sources. > > *What does it even mean?* > > Previously, in Java pipelines, you could apply a Beam SQL query to > existing PCollections. We have a special SqlTransform to do that, it > converts a SQL query to an equivalent PTransform that is applied to the > PCollection of Rows. > > One major inconvenience in this approach is that to query something, it > has to be a PCollection. I.e. you have to read the data from a specific > source and then convert it to rows. Which can mean multiple complications, > like potentially manually converting schemas from source to Beam, or having > a completely different logic when changing the source. > > The new API allows you to plug a schema provider that can resolve the > tables and schemas automatically if they already exist somewhere else. This > way Beam SQL, with the help of the provider, does the table lookup, then IO > configuration, and then schema conversion if needed. > > As an example, here's a query > <https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190>[1] > that joins 2 existing PCollections with a table from Hive using > HCatalogTableProvider. Hive table lookup is automatic, the table provider > in this case will resolve the tables by talking to Hive Metastore and will > read the data by configuring and applying the HCatalogIO, converting the > records to Rows under the hood. > > *What's the status of this?* > > This is a working implementation, but the development is still ongoing, > there are bugs, API might change, and there are few more things I can see > coming related to this after further design discussions: > > * refactor of the underlying table/metadata provider code; > * working out the design for supporting creating / updating the tables in > the metadata provider; > * creating a DDL syntax for it; > * creating more providers; > > [1] > https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190 >
