This is great. Being able to just simply query existing schema-fied data is such a huge win.
Kenn On Thu, Feb 14, 2019 at 12:30 PM Rui Wang <[email protected]> wrote: > Thanks Anton for contributing it! It's a big progress that BeamSQL can > connect to Hive metastore! The HCatalogTableProvider implementation is also > a good reference for people who want to implement table provider for their > metastore serivces. > > Just add another design discussion that I am aware of: > Figure it out what's the better way to manage autosevice table provider > registration approach and DDL approach in JDBC driver code path. > > -Rui > > On Thu, Feb 14, 2019 at 11:42 AM Anton Kedin <[email protected]> wrote: > >> Hi dev@, >> >> A quick update about a new Beam SQL feature. >> >> In short, we have wired up the support for plugging table providers >> through Beam SQL API to allow obtaining table schemas from external sources. >> >> *What does it even mean?* >> >> Previously, in Java pipelines, you could apply a Beam SQL query to >> existing PCollections. We have a special SqlTransform to do that, it >> converts a SQL query to an equivalent PTransform that is applied to the >> PCollection of Rows. >> >> One major inconvenience in this approach is that to query something, it >> has to be a PCollection. I.e. you have to read the data from a specific >> source and then convert it to rows. Which can mean multiple complications, >> like potentially manually converting schemas from source to Beam, or having >> a completely different logic when changing the source. >> >> The new API allows you to plug a schema provider that can resolve the >> tables and schemas automatically if they already exist somewhere else. This >> way Beam SQL, with the help of the provider, does the table lookup, then IO >> configuration, and then schema conversion if needed. >> >> As an example, here's a query >> <https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190>[1] >> that joins 2 existing PCollections with a table from Hive using >> HCatalogTableProvider. Hive table lookup is automatic, the table >> provider in this case will resolve the tables by talking to Hive Metastore >> and will read the data by configuring and applying the HCatalogIO, >> converting the records to Rows under the hood. >> >> *What's the status of this?* >> >> This is a working implementation, but the development is still ongoing, >> there are bugs, API might change, and there are few more things I can see >> coming related to this after further design discussions: >> >> * refactor of the underlying table/metadata provider code; >> * working out the design for supporting creating / updating the tables >> in the metadata provider; >> * creating a DDL syntax for it; >> * creating more providers; >> >> [1] >> https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190 >> >
