[GitHub] [iceberg] rdsr commented on a change in pull request #1172: Docs: Update spark.md for Spark 3

GitBox Wed, 08 Jul 2020 18:40:22 -0700


rdsr commented on a change in pull request #1172:
URL: https://github.com/apache/iceberg/pull/1172#discussion_r451772596




##########
File path: site/docs/spark.md
##########
@@ -19,42 +19,272 @@
 
 Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog 
implementations. Spark DSv2 is an evolving API with different levels of support 
in Spark versions.
 
-| Feature support                              | Spark 2.4 | Spark 3.0 
(unreleased) | Notes                                          |
-|----------------------------------------------|-----------|------------------------|------------------------------------------------|
-| [DataFrame reads](#reading-an-iceberg-table) | ✔️        | ✔️                
     |                                                |
-| [DataFrame append](#appending-data)          | ✔️        | ✔️                
     |                                                |
-| [DataFrame overwrite](#overwriting-data)     | ✔️        | ✔️                
     | Overwrite mode replaces partitions dynamically |
-| [Metadata tables](#inspecting-tables)        | ✔️        | ✔️                
     |                                                |
-| SQL create table                             |           | ✔️                
     |                                                |
-| SQL alter table                              |           | ✔️                
     |                                                |
-| SQL drop table                               |           | ✔️                
     |                                                |
-| SQL select                                   |           | ✔️                
     |                                                |
-| SQL create table as                          |           | ✔️                
     |                                                |
-| SQL replace table as                         |           | ✔️                
     |                                                |
-| SQL insert into                              |           | ✔️                
     |                                                |
-| SQL insert overwrite                         |           | ✔️                
     |                                                |
+| Feature support                                  | Spark 3.0| Spark 2.4  | 
Notes                                          |
+|--------------------------------------------------|----------|------------|------------------------------------------------|
+| [SQL create table](#create-table)                | ✔️        |            |  
                                              |
+| [SQL create table as](#create-table-as-select)   | ✔️        |            |  
                                              |
+| [SQL replace table as](#replace-table-as-select) | ✔️        |            |  
                                              |
+| [SQL alter table](#alter-table)                  | ✔️        |            |  
                                              |
+| [SQL drop table](#drop-table)                    | ✔️        |            |  
                                              |
+| [SQL select](#querying-with-sql)                 | ✔️        |            |  
                                              |
+| [SQL insert into](#insert-into)                  | ✔️        |            |  
                                              |
+| [SQL insert overwrite](#insert-overwrite)        | ✔️        |            |  
                                              |
+| [DataFrame reads](#querying-with-dataframes)     | ✔️        | ✔️          | 
                                               |
+| [DataFrame append](#appending-data)              | ✔️        | ✔️          | 
                                               |
+| [DataFrame overwrite](#overwriting-data)         | ✔️        | ✔️          | 
⚠ Behavior changed in Spark 3.0                |
+| [DataFrame CTAS and RTAS](#creating-tables)      | ✔️        |            |  
                                              |
+| [Metadata tables](#inspecting-tables)            | ✔️        | ✔️          | 
                                               |
+
+## Configuring catalogs
+
+Spark 3.0 adds an API to plug in table catalogs that are used to load, create, 
and manage Iceberg tables. Spark catalogs are configured by setting Spark 
properties under `spark.sql.catalog`.
+
+This creates an Iceberg catalog named `hive_prod` that loads tables from a 
Hive metastore:
+
+```plain
+spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.hive_prod.type = hive
+spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
+```
+
+Iceberg also supports a directory-based catalog in HDFS that can be configured 
using `type=hadoop`:
+
+```plain
+spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.hadoop_prod.type = hive
+spark.sql.catalog.hadoop_prod.warehouse = hdfs://nn:8020/warehouse/path
+```
+
+!!! Note
+    The Hive-based catalog only loads Iceberg tables. To load non-Iceberg 
tables in the same Hive metastore, use a [session 
catalog](#replacing-the-session-catalog).
+
+### Using catalogs
+
+Catalog names are used in SQL queries to identify a table. In the examples 
above, `hive_prod` and `hadoop_prod` can be used to prefix database and table 
names that will be loaded from those catalogs.
+
+```sql
+SELECT * FROM hive_prod.db.table -- load db.table from catalog hive_prod
+```
+
+Spark 3 keeps track of a current catalog and namespace, which can be omitted 
from table names.

Review comment:
       nit: should this be  "... track of *the* current catalog..."

##########
File path: site/docs/spark.md
##########
@@ -19,42 +19,272 @@
 
 Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog 
implementations. Spark DSv2 is an evolving API with different levels of support 
in Spark versions.
 
-| Feature support                              | Spark 2.4 | Spark 3.0 
(unreleased) | Notes                                          |
-|----------------------------------------------|-----------|------------------------|------------------------------------------------|
-| [DataFrame reads](#reading-an-iceberg-table) | ✔️        | ✔️                
     |                                                |
-| [DataFrame append](#appending-data)          | ✔️        | ✔️                
     |                                                |
-| [DataFrame overwrite](#overwriting-data)     | ✔️        | ✔️                
     | Overwrite mode replaces partitions dynamically |
-| [Metadata tables](#inspecting-tables)        | ✔️        | ✔️                
     |                                                |
-| SQL create table                             |           | ✔️                
     |                                                |
-| SQL alter table                              |           | ✔️                
     |                                                |
-| SQL drop table                               |           | ✔️                
     |                                                |
-| SQL select                                   |           | ✔️                
     |                                                |
-| SQL create table as                          |           | ✔️                
     |                                                |
-| SQL replace table as                         |           | ✔️                
     |                                                |
-| SQL insert into                              |           | ✔️                
     |                                                |
-| SQL insert overwrite                         |           | ✔️                
     |                                                |
+| Feature support                                  | Spark 3.0| Spark 2.4  | 
Notes                                          |
+|--------------------------------------------------|----------|------------|------------------------------------------------|
+| [SQL create table](#create-table)                | ✔️        |            |  
                                              |
+| [SQL create table as](#create-table-as-select)   | ✔️        |            |  
                                              |
+| [SQL replace table as](#replace-table-as-select) | ✔️        |            |  
                                              |
+| [SQL alter table](#alter-table)                  | ✔️        |            |  
                                              |
+| [SQL drop table](#drop-table)                    | ✔️        |            |  
                                              |
+| [SQL select](#querying-with-sql)                 | ✔️        |            |  
                                              |
+| [SQL insert into](#insert-into)                  | ✔️        |            |  
                                              |
+| [SQL insert overwrite](#insert-overwrite)        | ✔️        |            |  
                                              |
+| [DataFrame reads](#querying-with-dataframes)     | ✔️        | ✔️          | 
                                               |
+| [DataFrame append](#appending-data)              | ✔️        | ✔️          | 
                                               |
+| [DataFrame overwrite](#overwriting-data)         | ✔️        | ✔️          | 
⚠ Behavior changed in Spark 3.0                |
+| [DataFrame CTAS and RTAS](#creating-tables)      | ✔️        |            |  
                                              |
+| [Metadata tables](#inspecting-tables)            | ✔️        | ✔️          | 
                                               |
+
+## Configuring catalogs
+
+Spark 3.0 adds an API to plug in table catalogs that are used to load, create, 
and manage Iceberg tables. Spark catalogs are configured by setting Spark 
properties under `spark.sql.catalog`.
+
+This creates an Iceberg catalog named `hive_prod` that loads tables from a 
Hive metastore:
+
+```plain
+spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.hive_prod.type = hive
+spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
+```
+
+Iceberg also supports a directory-based catalog in HDFS that can be configured 
using `type=hadoop`:
+
+```plain
+spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.hadoop_prod.type = hive

Review comment:
       Did you mean `spark.sql.catalog.hadoop_prod.type = hadoop`

##########
File path: site/docs/spark.md
##########
@@ -91,50 +332,188 @@ spark.sql("""select count(1) from table""").show()
 ```
 
 
+## Writing with SQL
+
+Spark 3 supports SQL `INSERT INTO` and `INSERT OVERWRITE`, as well as the new 
`DataFrameWriterV2` API.
+
+### `INSERT INTO`
+
+To append new data to a table, use `INSERT INTO`.
+
+```sql
+INSERT INTO prod.db.table VALUES (1, 'a'), (2, 'b')
+```
+```sql
+INSERT INTO prod.db.table SELECT ...
+```
+
+### `INSERT OVERWRITE`
+
+To replace data in the table with the result of a query, use `INSERT 
OVERWRITE`. Overwrites are atomic operations for Iceberg tables.
+
+The partitions that will be replaced by `INSERT OVERWRITE` depends on Spark's 
partition overwrite mode and the partitioning of a table.
+
+#### Overwrite behavior
+
+Spark's default overwrite mode is **static**, but **dynamic overwrite mode is 
recommended when writing to Iceberg tables.** Static overwrite mode determines 
which partitions to overwrite in a table by converting the `PARTITION` clause 
to a filter, but the `PARTITION` clause can only reference table columns.
+
+Dynamic overwrite mode is configured by setting 
`spark.sql.sources.partitionOverwriteMode=dynamic`.
+
+To demonstrate the behavior of dynamic and static overwrites, consider a 
`logs` table defined by the following DDL:
+
+```sql
+CREATE TABLE prod.my_app.logs (
+    uuid string NOT NULL,
+    level string NOT NULL,
+    ts timestamp NOT NULL,
+    message string)
+USING iceberg
+PARTITIONED BY (level, hours(ts))
+```
+
+#### Dynamic overwrite
+
+When Spark's overwrite mode is dynamic, partitions that have rows produced by 
the `SELECT` query will be replaced.
+
+For example, this query removes duplicate log events from the example `logs` 
table.
+
+```sql
+INSERT OVERWRITE prod.my_app.logs
+SELECT uuid, first(level), first(ts), first(message)
+FROM prod.my_app.logs
+WHERE cast(ts as date) = '2020-07-01'
+GROUP BY uuid
+```
+
+In dynamic mode, this will replace any partition with rows in the `SELECT` 
result. Because the date of all rows is restricted 1 July, only hours of that 
day will be replaced.

Review comment:
       nit: `restricted to 1st July`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdsr commented on a change in pull request #1172: Docs: Update spark.md for Spark 3

Reply via email to