Re: [DISCUSS] Query external resources as Tables with Beam SQL

Andrew Pilloud Thu, 05 Mar 2020 10:55:26 -0800

I'm not following the "CREATE TABLE" vs "CREATE EXTERNAL TABLE"
distinction. We added the "EXTERNAL" to make it clear that Beam wasn't
storing the table. Most of our current table providers will create the
underlying table as needed.


Andrew

On Thu, Mar 5, 2020 at 10:47 AM Rui Wang <[email protected]> wrote:

> There are two pieces of news from the proposal:
> 1. Spanner source in SQL. (Welcome to contribute it)
> 2. CREATE TABLE statement than CREATE EXTERNAL TABLE (the difference is
> whether assuming the table exists or not)
>
>
> There is a table property in the statement already that you can reuse to
> save your options.
>
>
> -Rui
>
>
>
>
>
>
>
> On Thu, Mar 5, 2020 at 2:30 AM Taher Koitawala <[email protected]> wrote:
>
>> Also auto creation is not there
>>
>> On Thu, Mar 5, 2020 at 3:59 PM Taher Koitawala <[email protected]>
>> wrote:
>>
>>> Proposal is to add more sources and also have time event time or
>>> processing enhancements further on them
>>>
>>> On Thu, Mar 5, 2020 at 3:50 PM Andrew Pilloud <[email protected]>
>>> wrote:
>>>
>>>> I believe we have this functionality alredy:
>>>> https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/
>>>>
>>>> Existing GCP tables can also be loaded through the GCP datacatalog
>>>> metastore. What are you proposing that is new?
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Thu, Mar 5, 2020, 12:29 AM Taher Koitawala <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>          We have been using Apache Beam extensively to process huge
>>>>> amounts of data, while beam is really powerful and can solve a huge number
>>>>> of use cases. A Beam job's development and testing time is significantly
>>>>> high.
>>>>>
>>>>>    This gap can be filled with Beam SQL, where a complete SQL based
>>>>> interface can reduce development and testing time to matter of minutes, it
>>>>> also makes Apache Beam more user friendly where a wide variety of audience
>>>>> with different analytical skillsets can interact.
>>>>>
>>>>> The current Beam SQL is still needs to be used programmatically, and
>>>>> so I propose the following additions/improvements.
>>>>>
>>>>> *Note: Whist the below given examples are more GCP biased, they apply
>>>>> to other sources in a generic manner*
>>>>>
>>>>> For Example: Imagine a user who wants to write a stream processing job
>>>>> on Google Cloud Dataflow. The user wants to process credit card 
>>>>> transaction
>>>>> streams from Google Cloud PubSub (Something like Kafka) and enrich each
>>>>> record of the stream with some data that is stored in Google Cloud 
>>>>> Spanner,
>>>>> after enrichment the user wishes to write the following data to Google
>>>>> Cloud BigQuery.
>>>>>
>>>>> Given Below are the queries which the user should be able to fire on
>>>>> Beam and the rest should be automatically handled by the framework.
>>>>>
>>>>> //Infer schema from Spanner table upon table creation
>>>>>
>>>>> CREATE TABLE SPANNER_CARD_INFO
>>>>>
>>>>> OPTIONS (
>>>>>
>>>>>  ProjectId: “gcp-project”,
>>>>>
>>>>>  InstanceId : “spanner-instance-id”,
>>>>>
>>>>>  Database: “some-database”,
>>>>>
>>>>>  Table: “card_info”,
>>>>>
>>>>>  CloudResource: “SPANNER”,
>>>>>
>>>>> CreateTableIfNotExists: “FALSE”
>>>>>
>>>>>   )
>>>>>  //Apply schema to each record read from pubsub, and then apply SQL.
>>>>>
>>>>> CREATE TABLE TRANSACTIONS_PUBSUB_TOPIC
>>>>>
>>>>> OPTIONS (
>>>>>
>>>>> ProjectId: “gcp-project”,
>>>>>
>>>>> Topic: “card-transactions”,
>>>>>
>>>>> CloudResource : “PUBSUB”
>>>>>
>>>>> SubscriptionId : “subscriptionId-1”,
>>>>>
>>>>> CreateTopicIfNotExists: “FALSE”,
>>>>>
>>>>> CreateSubscriptionIfNotExist: “TRUE”,
>>>>>
>>>>> RecordType: “JSON” //POssible values: Avro, JSON, TVS..etc
>>>>>
>>>>> JsonRecordSchema : “{
>>>>>
>>>>> “CardNumber” : “INT”,
>>>>>
>>>>> “Amount”: “DOUBLE”,
>>>>>
>>>>> “eventTimeStamp” : “EVENT_TIME”
>>>>>
>>>>> }”)
>>>>>
>>>>> //Create table in BigQuery if not exists and insert
>>>>>
>>>>> CREATE TABLE TRANSACTION_HISTORY
>>>>>
>>>>> OPTIONS (
>>>>>
>>>>> ProjectId: “gcp-project”,
>>>>>
>>>>> CloudResource : “BIGQUERY”
>>>>>
>>>>> dataset: “dataset1”,
>>>>>
>>>>> table : “table1”,
>>>>>
>>>>> CreateTableIfNotExists: “TRUE”,
>>>>>
>>>>> TableSchema : “
>>>>>
>>>>> {
>>>>>
>>>>> “card_number” : “INT”,
>>>>>
>>>>> “first_name” : “STRING”,
>>>>>
>>>>> “last_name” : “STRING”,
>>>>>
>>>>> “phone” : “INT”,
>>>>>
>>>>> “city” : “STRING”,
>>>>>
>>>>> “amount”: “FLOAT”,
>>>>>
>>>>> “eventtimestamp” : “INT”,
>>>>>
>>>>> }”)
>>>>>
>>>>> //Actual query that should get stretched to a Beam dag
>>>>>
>>>>> INSERT INTO TRANSACTION_HISTORY
>>>>>
>>>>> SELECT
>>>>> pubsub.card_number,spanner.first_name,spanner.last_name,spanner.phone,spanner.city,pubsub.amount,pubsub.eventTimeStamp
>>>>> FROM TRANSACTIONS_PUBSUB_TOPIC pubsub join SPANNER_CARD_INFO spanner
>>>>> on (pubsub.card_number = spanner.card_number);
>>>>>
>>>>>
>>>>>
>>>>> Also to consider that if any of the sources or sinks change, we only
>>>>> change the SQL and done!.
>>>>>
>>>>> Please let me know your thoughts about this.
>>>>>
>>>>> Regards,
>>>>> Taher Koitawala
>>>>>
>>>>>

Re: [DISCUSS] Query external resources as Tables with Beam SQL

Reply via email to