This is great. Being able to just simply query existing schema-fied data is
such a huge win.

Kenn

On Thu, Feb 14, 2019 at 12:30 PM Rui Wang <[email protected]> wrote:

> Thanks Anton for contributing it! It's a big progress that BeamSQL can
> connect to Hive metastore! The HCatalogTableProvider implementation is also
> a good reference for people who want to implement table provider for their
> metastore serivces.
>
> Just add another design discussion that I am aware of:
> Figure it out what's the better way to manage autosevice table provider
> registration approach and DDL approach in JDBC driver code path.
>
> -Rui
>
> On Thu, Feb 14, 2019 at 11:42 AM Anton Kedin <[email protected]> wrote:
>
>> Hi dev@,
>>
>> A quick update about a new Beam SQL feature.
>>
>> In short, we have wired up the support for plugging table providers
>> through Beam SQL API to allow obtaining table schemas from external sources.
>>
>> *What does it even mean?*
>>
>> Previously, in Java pipelines, you could apply a Beam SQL query to
>> existing PCollections. We have a special SqlTransform to do that, it
>> converts a SQL query to an equivalent PTransform that is applied to the
>> PCollection of Rows.
>>
>> One major inconvenience in this approach is that to query something, it
>> has to be a PCollection. I.e. you have to read the data from a specific
>> source and then convert it to rows. Which can mean multiple complications,
>> like potentially manually converting schemas from source to Beam, or having
>> a completely different logic when changing the source.
>>
>> The new API allows you to plug a schema provider that can resolve the
>> tables and schemas automatically if they already exist somewhere else. This
>> way Beam SQL, with the help of the provider, does the table lookup, then IO
>> configuration, and then schema conversion if needed.
>>
>> As an example, here's a query
>> <https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190>[1]
>> that joins 2 existing PCollections with a table from Hive using
>> HCatalogTableProvider. Hive table lookup is automatic, the table
>> provider in this case will resolve the tables by talking to Hive Metastore
>> and will read the data by configuring and applying the HCatalogIO,
>> converting the records to Rows under the hood.
>>
>> *What's the status of this?*
>>
>> This is a working implementation, but the development is still ongoing,
>> there are bugs, API might change, and there are few more things I can see
>> coming related to this after further design discussions:
>>
>>  * refactor of the underlying table/metadata provider code;
>>  * working out the design for supporting creating / updating the tables
>> in the metadata provider;
>>  * creating a DDL syntax for it;
>>  * creating more providers;
>>
>> [1]
>> https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190
>>
>

Reply via email to