Re: [DISCUSS] Query external resources as Tables with Beam SQL

Rui Wang Thu, 05 Mar 2020 10:47:23 -0800

There are two pieces of news from the proposal:
1. Spanner source in SQL. (Welcome to contribute it)
2. CREATE TABLE statement than CREATE EXTERNAL TABLE (the difference is
whether assuming the table exists or not)



There is a table property in the statement already that you can reuse to
save your options.


-Rui







On Thu, Mar 5, 2020 at 2:30 AM Taher Koitawala <taher...@gmail.com> wrote:

> Also auto creation is not there
>
> On Thu, Mar 5, 2020 at 3:59 PM Taher Koitawala <taher...@gmail.com> wrote:
>
>> Proposal is to add more sources and also have time event time or
>> processing enhancements further on them
>>
>> On Thu, Mar 5, 2020 at 3:50 PM Andrew Pilloud <apill...@google.com>
>> wrote:
>>
>>> I believe we have this functionality alredy:
>>> https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/
>>>
>>> Existing GCP tables can also be loaded through the GCP datacatalog
>>> metastore. What are you proposing that is new?
>>>
>>> Andrew
>>>
>>>
>>> On Thu, Mar 5, 2020, 12:29 AM Taher Koitawala <taher...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>          We have been using Apache Beam extensively to process huge
>>>> amounts of data, while beam is really powerful and can solve a huge number
>>>> of use cases. A Beam job's development and testing time is significantly
>>>> high.
>>>>
>>>>    This gap can be filled with Beam SQL, where a complete SQL based
>>>> interface can reduce development and testing time to matter of minutes, it
>>>> also makes Apache Beam more user friendly where a wide variety of audience
>>>> with different analytical skillsets can interact.
>>>>
>>>> The current Beam SQL is still needs to be used programmatically, and so
>>>> I propose the following additions/improvements.
>>>>
>>>> *Note: Whist the below given examples are more GCP biased, they apply
>>>> to other sources in a generic manner*
>>>>
>>>> For Example: Imagine a user who wants to write a stream processing job
>>>> on Google Cloud Dataflow. The user wants to process credit card transaction
>>>> streams from Google Cloud PubSub (Something like Kafka) and enrich each
>>>> record of the stream with some data that is stored in Google Cloud Spanner,
>>>> after enrichment the user wishes to write the following data to Google
>>>> Cloud BigQuery.
>>>>
>>>> Given Below are the queries which the user should be able to fire on
>>>> Beam and the rest should be automatically handled by the framework.
>>>>
>>>> //Infer schema from Spanner table upon table creation
>>>>
>>>> CREATE TABLE SPANNER_CARD_INFO
>>>>
>>>> OPTIONS (
>>>>
>>>>  ProjectId: “gcp-project”,
>>>>
>>>>  InstanceId : “spanner-instance-id”,
>>>>
>>>>  Database: “some-database”,
>>>>
>>>>  Table: “card_info”,
>>>>
>>>>  CloudResource: “SPANNER”,
>>>>
>>>> CreateTableIfNotExists: “FALSE”
>>>>
>>>>   )
>>>>  //Apply schema to each record read from pubsub, and then apply SQL.
>>>>
>>>> CREATE TABLE TRANSACTIONS_PUBSUB_TOPIC
>>>>
>>>> OPTIONS (
>>>>
>>>> ProjectId: “gcp-project”,
>>>>
>>>> Topic: “card-transactions”,
>>>>
>>>> CloudResource : “PUBSUB”
>>>>
>>>> SubscriptionId : “subscriptionId-1”,
>>>>
>>>> CreateTopicIfNotExists: “FALSE”,
>>>>
>>>> CreateSubscriptionIfNotExist: “TRUE”,
>>>>
>>>> RecordType: “JSON” //POssible values: Avro, JSON, TVS..etc
>>>>
>>>> JsonRecordSchema : “{
>>>>
>>>> “CardNumber” : “INT”,
>>>>
>>>> “Amount”: “DOUBLE”,
>>>>
>>>> “eventTimeStamp” : “EVENT_TIME”
>>>>
>>>> }”)
>>>>
>>>> //Create table in BigQuery if not exists and insert
>>>>
>>>> CREATE TABLE TRANSACTION_HISTORY
>>>>
>>>> OPTIONS (
>>>>
>>>> ProjectId: “gcp-project”,
>>>>
>>>> CloudResource : “BIGQUERY”
>>>>
>>>> dataset: “dataset1”,
>>>>
>>>> table : “table1”,
>>>>
>>>> CreateTableIfNotExists: “TRUE”,
>>>>
>>>> TableSchema : “
>>>>
>>>> {
>>>>
>>>> “card_number” : “INT”,
>>>>
>>>> “first_name” : “STRING”,
>>>>
>>>> “last_name” : “STRING”,
>>>>
>>>> “phone” : “INT”,
>>>>
>>>> “city” : “STRING”,
>>>>
>>>> “amount”: “FLOAT”,
>>>>
>>>> “eventtimestamp” : “INT”,
>>>>
>>>> }”)
>>>>
>>>> //Actual query that should get stretched to a Beam dag
>>>>
>>>> INSERT INTO TRANSACTION_HISTORY
>>>>
>>>> SELECT
>>>> pubsub.card_number,spanner.first_name,spanner.last_name,spanner.phone,spanner.city,pubsub.amount,pubsub.eventTimeStamp
>>>> FROM TRANSACTIONS_PUBSUB_TOPIC pubsub join SPANNER_CARD_INFO spanner
>>>> on (pubsub.card_number = spanner.card_number);
>>>>
>>>>
>>>>
>>>> Also to consider that if any of the sources or sinks change, we only
>>>> change the SQL and done!.
>>>>
>>>> Please let me know your thoughts about this.
>>>>
>>>> Regards,
>>>> Taher Koitawala
>>>>
>>>>

Re: [DISCUSS] Query external resources as Tables with Beam SQL

Reply via email to