This is an automated email from the ASF dual-hosted git repository. mergebot-role pushed a commit to branch mergebot in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit 2486614a3f8422fa26e32b9678521fc562a12325 Author: Andrew Pilloud <[email protected]> AuthorDate: Thu Jun 28 14:00:55 2018 -0700 Update CREATE TABLE from doc --- .../dsls/sql/statements/create-table.md | 336 ++++++++++++++++++++- 1 file changed, 324 insertions(+), 12 deletions(-) diff --git a/src/documentation/dsls/sql/statements/create-table.md b/src/documentation/dsls/sql/statements/create-table.md index 40d9495..67fb74d 100644 --- a/src/documentation/dsls/sql/statements/create-table.md +++ b/src/documentation/dsls/sql/statements/create-table.md @@ -20,19 +20,331 @@ limitations under the License. # CREATE TABLE -Beam is not a storage system but reads and writes from other storage systems. -You register those systems with a `CREATE TABLE` statement that includes a schema -as well as a number of extended clauses: +Beam SQL's `CREATE TABLE` statement registers a virtual table that maps to an +[external storage system](https://beam.apache.org/documentation/io/built-in/). +For some storage systems, `CREATE TABLE` does not create a physical table until +a write occurs. After the physical table exists, you can access the table with +the `SELECT`, `JOIN`, and `INSERT INTO` statements. - - `TYPE` to indicate what - - `LOCATION` to specify a URL or otherwise indicate where the data is - - `TBLPROPERTIES` to configure the endpoint +The `CREATE TABLE` statement includes a schema and extended clauses. -Once a table is registered, it may be read-only or it may support both read and -write access. +## Syntax -Currently there are a few experimental connectors available, and the reference -for them is their Javadoc: +``` +CREATE TABLE [ IF NOT EXISTS ] tableName (tableElement [, tableElement ]*) +TYPE type +[LOCATION location] +[TBLPROPERTIES tblProperties] - - [Kafka]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/extensions/sql/meta/provider/kafka/KafkaTableProvider.html) - - [Text]({{ site.baseurl }}/documentation/sdks/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/extensions/sql/meta/provider/text/TextTableProvider.html) +simpleType: TINYINT | SMALLINT | INTEGER | BIGINT | FLOAT | DOUBLE | DECIMAL | BOOLEAN | DATE | TIME | TIMESTAMP | CHAR | VARCHAR + +fieldType: simpleType | MAP<simpleType, fieldType> | ARRAY<fieldType> | ROW<tableElement [, tableElement ]*> + +tableElement: columnName fieldType [ NOT NULL ] +``` + +* `IF NOT EXISTS`: Optional. If the table is already registered, Beam SQL + ignores the statement instead of returning an error. +* `tableName`: The case sensitive name of the table to create and register, + specified as an + [Identifier](https://beam.apache.org/documentation/dsls/sql/lexical/#identifiers). + The table name does not need to match the name in the underlying data + storage system. +* `tableElement`: `columnName` `fieldType` `[ NOT NULL ]` + * `columnName`: The case sensitive name of the column, specified as a + backtick_quoted_expression. + * `fieldType`: The field's type, specified as one of the following types: + * `simpleType`: `TINYINT`, `SMALLINT`, `INTEGER`, `BIGINT`, `FLOAT`, + `DOUBLE`, `DECIMAL`, `BOOLEAN`, `DATE`, `TIME`, `TIMESTAMP`, `CHAR`, + `VARCHAR` + * `MAP<simpleType, fieldType>` + * `ARRAY<fieldType>` + * `ROW<tableElement [, tableElement ]*>` + * `NOT NULL`: Optional. Indicates that the column is not nullable. +* `type`: The I/O transform that backs the virtual table, specified as an + [Identifier](https://beam.apache.org/documentation/dsls/sql/lexical/#identifiers) + with one of the following values: + * `bigquery` + * `pubsub` + * `kafka` + * `text` +* `location`: The I/O specific location of the underlying table, specified as + a [String + Literal](https://beam.apache.org/documentation/dsls/sql/lexical/#string-literals). + See the I/O specific sections for `location` format requirements. +* `tblProperties`: The I/O specific quoted key value JSON object with extra + configuration, specified as a [String + Literal](https://beam.apache.org/documentation/dsls/sql/lexical/#string-literals). + See the I/O specific sections for `tblProperties` format requirements. + +## BigQuery + +### Syntax + +``` +CREATE TABLE [ IF NOT EXISTS ] tableName (tableElement [, tableElement ]*) +TYPE bigquery +LOCATION '[PROJECT_ID]:[DATASET].[TABLE]' +``` + +* `LOCATION:`Location of the table in the BigQuery CLI format. + * `PROJECT_ID`: ID of the Google Cloud Project + * `DATASET`: BigQuery Dataset ID + * `TABLE`: BigQuery Table ID within the Dataset + +### Read Mode + +Not supported. BigQueryI/O is currently limited to write access only in Beam +SQL. + +### Write Mode + +if the table does not exist, Beam creates the table specified in location when +the first record is written. If the table does exist, the specified columns must +match the existing table. + +### Schema + +Schema-related errors will cause the pipeline to crash. The Map type is not +supported. Beam SQL types map to [BigQuery Standard SQL +types](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types) +as follows: + +<table> + <tr> + <td>Beam SQL Type + </td> + <td>BigQuery Standard SQL Type + </td> + </tr> + <tr> + <td>TINYINT, SMALLINT, INTEGER, BIGINT + </td> + <td>INT64 + </td> + </tr> + <tr> + <td>FLOAT, DOUBLE, DECIMAL + </td> + <td>FLOAT64 + </td> + </tr> + <tr> + <td>BOOLEAN + </td> + <td>BOOL + </td> + </tr> + <tr> + <td>DATE + </td> + <td>DATE + </td> + </tr> + <tr> + <td>TIME + </td> + <td>TIME + </td> + </tr> + <tr> + <td>TIMESTAMP + </td> + <td>TIMESTAMP + </td> + </tr> + <tr> + <td>CHAR, VARCHAR + </td> + <td>STRING + </td> + </tr> + <tr> + <td>MAP + </td> + <td>(not supported) + </td> + </tr> + <tr> + <td>ARRAY + </td> + <td>ARRAY + </td> + </tr> + <tr> + <td>ROW + </td> + <td>STRUCT + </td> + </tr> +</table> + +### Example + +``` +CREATE TABLE users (id INTEGER, username VARCHAR) +TYPE bigquery +LOCATION 'testing-integration:apache.users' +``` + +## Pub/Sub + +### Syntax + +``` +CREATE TABLE [ IF NOT EXISTS ] tableName + ( + event_timestamp TIMESTAMP, + attributes MAP<VARCHAR, VARCHAR>, + payload ROW<tableElement [, tableElement ]*> + ) +TYPE pubsub +LOCATION 'projects/[PROJECT]/topics/[TOPIC]' +TBLPROPERTIES '{"timestampAttributeKey": "key", "deadLetterQueue": "projects/[PROJECT]/topics/[TOPIC]"}' +``` + +* `event_timestamp`: The event timestamp associated with the Pub/Sub message + by PubsubIO. It can be one of the following: + * Message publish time, which is provided by Pub/Sub. This is the default + value if no extra configuration is provided. + * A timestamp specified in one of the user-provided message attributes. + The attribute key is configured by the `timestampAttributeKey` field of + the `tblProperties` blob. The value of the attribute should conform to + the [requirements of + PubsubIO](https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withTimestampAttribute-java.lang.String-), + which is either millis since Unix epoch or [RFC 339 + ](https://www.ietf.org/rfc/rfc3339.txt)date string. +* `attributes`: The user-provided attributes map from the Pub/Sub message; +* `payload`: The schema of the JSON payload of the Pub/Sub message. No other + payload formats are currently supported by Beam SQL. If a record can't be + unmarshalled, the record is written to the topic specified in the + `deadLeaderQueue` field of the `tblProperties` blob. If no dead-letter queue + is specified in this case, an exception is thrown and the pipeline will + crash. +* `LOCATION`: + * `PROJECT`: ID of the Google Cloud Project + * `TOPIC`: The Pub/Sub topic name. A subscription will be created + automatically, but the subscription is not cleaned up automatically. + Specifying an existing subscription is not supported. +* `TBLPROPERTIES`: + * `timestampAttributeKey`: Optional. The key which contains the event + timestamp associated with the Pub/Sub message. If not specified, the + message publish timestamp is used as an event timestamp for + windowing/watermarking. + * `deadLetterQueue`: The topic into which messages are written if the + payload was not parsed. If not specified, an exception is thrown for + parsing failures. + +### Read Mode + +PubsubIO is currently limited to read access only. + +### Write Mode + +Not supported. PubSubIO is currently limited to read access only in Beam SQL. + +### Schema + +Pub/Sub messages have metadata associated with them, and you can reference this +metadata in your queries. For each message, Pub/Sub exposes its publish time and +a map of user-provided attributes in addition to the payload (unstructured in +the general case). This information must be preserved and accessible from the +SQL statements. Currently, this means that PubsubIO tables require you to +declare a special set of columns, as shown below. + +### Supported Payload + +* JSON Objects + * Beam only supports querying messages with payload containing JSON + objects. Beam attempts to parse JSON to match the schema of the + `payload` field. + +### Example + +``` +CREATE TABLE locations (event_timestamp TIMESTAMP, attributes MAP<VARCHAR, VARCHAR>, payload ROW<id INTEGER, location VARCHAR>) +TYPE pubsub +LOCATION 'projects/testing-integration/topics/user-location' +``` + +## Kafka + +KafkaIO is experimental in Beam SQL. + +### Syntax + +``` +CREATE TABLE [ IF NOT EXISTS ] tableName (tableElement [, tableElement ]*) +TYPE kafka +LOCATION 'kafka://localhost:2181/brokers' +TBLPROPERTIES '{"bootstrap.servers":"localhost:9092", "topics": ["topic1", "topic2"]}' +``` + +* `LOCATION`: The Kafka topic URL. +* `TBLPROPERTIES`: + * `bootstrap.servers`: Optional. Allows you to specify the bootstrap + server. + * `topics`: Optional. Allows you to specify specific topics. + +### Read Mode + +Read Mode supports reading from a topic. + +### Write Mode + +Write Mode supports writing to a topic. + +### Supported Payload + +* CSV + * Beam parses the messages, attempting to parse fields according to the + types specified in the schema. + +### Schema + +Only simple types are supported. + +## Text + +TextIO is experimental in Beam SQL. Read Mode and Write Mode do not currently +access the same underlying data. + +### Syntax + +``` +CREATE TABLE [ IF NOT EXISTS ] tableName (tableElement [, tableElement ]*) +TYPE text +LOCATION '/home/admin/orders' +TBLPROPERTIES '{"format: "Excel"}' +``` + +* `LOCATION`: The path to the file for Read Mode. The prefix for Write Mode. +* `TBLPROPERTIES`: + * `format`: Optional. Allows you to specify the + [CSVFormat](https://commons.apache.org/proper/commons-csv/archives/1.5/apidocs/org/apache/commons/csv/CSVFormat.Predefined.html). + +### Read Mode + +Read Mode supports reading from a file. + +### Write Mode + +Write Mode supports writing to a set of files. TextIO creates file on writes. + +### Supported Payload + +* CSV + * Beam parses the messages, attempting to parse fields according to the + types specified in the schema using org.apache.commons.csv. + +### Schema + +Only simple types are supported. + +### Example + +``` +CREATE TABLE orders (id INTEGER, price INTEGER) +TYPE text +LOCATION '/home/admin/orders' +```
