[GitHub] [iceberg] rdblue commented on a change in pull request #1891: AWS: documentation page for AWS module

GitBox Tue, 22 Dec 2020 15:30:32 -0800


rdblue commented on a change in pull request #1891:
URL: https://github.com/apache/iceberg/pull/1891#discussion_r547556329




##########
File path: site/docs/aws.md
##########
@@ -0,0 +1,170 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+ 
+# Iceberg AWS Integrations
+
+Iceberg provides integration with different AWS services through the 
`iceberg-aws` module. 
+This section describes how to use Iceberg with AWS.
+
+## Runtime Packages
+
+The `iceberg-aws` module is bundled with Spark and Flink engine runtimes.
+However, the AWS clients are not bundled so that you can use the same version 
as your application.
+Please note that we use the new AWS v2 SDK instead of v1.
+You can choose to the [AWS SDK 
bundle](https://mvnrepository.com/artifact/software.amazon.awssdk/bundle), 
+or individual AWS client packages (Glue, S3, DynamoDB, KMS) if you would like 
to have a minimum dependency footprint.
+
+For example, to use AWS features with Spark 3 and AWS clients version 2.15.40, 
you can start the SQL shell with:
+
+```sh
+spark-sql --packages 
org.apache.iceberg:iceberg-spark3-runtime:0.11.0,software.amazon.awssdk:bundle:2.15.40
 \
+    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf 
spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
 \
+    --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my-key-prefix 
\
+    --conf spark.sql.catalog.my_catalog.gluecatalog.lock.table=myGlueLockTable
+```
+
+## Glue Catalog
+
+Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the 
`Catalog` implementation.
+When used, an Iceberg `Namespace` is stored as a [Glue 
Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html),
 
+an Iceberg `Table` is stored as a [Glue 
Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),
+an Iceberg `Snapshot` is stored as a [Glue 
TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion).
 
+You can start using Glue catalog by specifying the `catalog-impl` as 
`org.apache.iceberg.aws.glue.GlueCatalog`. 
+More details about loading the catalog can be found in individual engine 
pages, such as [Spark](../spark/#loading-a-custom-catalog) and 
[Flink](../flink/#creating-catalogs-and-using-catalogs).
+
+### Glue Catalog ID
+
+You can specify the Glue catalog ID through `gluecatalog.id` catalog property 
to point to a Glue catalog in a different AWS account.
+The Glue catalog ID is your numeric AWS account ID.
+If the Glue catalog is in a different region, you should configure you AWS 
client to point to the correct region, 
+see more details in [AWS client configuration](#aws-client-configurations).
+
+### Skip Archive
+
+By default, Glue will store all the table versions created and user can 
rollback a table to any historical version if needed.
+However, if you are streaming data to Iceberg, this will easily create a lot 
of Glue table versions.
+Therefore, it is recommended to turn off the archive feature in Glue by 
setting `gluecatalog.skip-archive` to false.
+For more details, please read [Glue 
Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the 
[UpdateTable 
API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html).
+
+### DynamoDB for locking Glue tables
+
+Glue does not have a strong guarantee over concurrent updates to a table. 
+Although it throws `ConcurrentModificationException` when detecting two 
processes updating a table at the same time,
+there is no guarantee that one update would not clobber the other update.
+Therefore, DynamoDB lock is enabled by default for Glue, so that for every 
commit, 
+`GlueCatalog` first obtains a lock using a helper DynamoDB table and then try 
to safely modify the Glue table.
+User must specify a table name through catalog property 
`gluecatalog.lock.table` as the helper DynamoDB lock table to use.
+It is recommend to use the same DynamoDB table for operations in the same Glue 
catalog,
+and use a different table for a different Glue catalog in another account or 
region.
+If the lock table with the given name does not exist in DynamoDB, a new table 
is created with billing mode set as 
[Pay-per-Request](https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing).
+The lock has the following additional properties:
+
+* `gluecatalog.lock.wait-ms`:  max time to wait for lock acquisition, defaults 
to 3 minutes
+* `gluecatalog.lock.expire-ms`: max time a table can be locked by a process, 
defaults to 20 minutes
+
+If your use case only consists of single-process low-frequency (e.g. hourly, 
daily) updates to a table,
+you can also turn off this locking feature by setting 
`gluecatalog.lock.enabled` as false.
+
+### Warehouse Location
+
+By default, Glue uses `S3FileIO` and only allows a warehouse location in S3. 
+To store data in a different local or cloud store, Glue catalog can switch to 
use `HadoopFileIO` 
+or any custom FileIO using the mechanism described in the [custom 
FileIO](../custom-catalog/#custom-file-io-implementation) section.
+
+## S3 FileIO
+
+Iceberg allows users to write data to S3 through `S3FileIO`.
+`GlueCatalog` by default uses this FileIO, and other catalogs can load this 
FileIO using the `io-impl` catalog property.
+
+### Progressive Multipart Upload
+
+`S3FileIO` implements a customized progressive multipart upload algorithm to 
upload data.
+Data files are uploaded in parts in parallel as soon as each part is ready,
+and each file part is deleted as soon as its upload process completes.
+This provides maximized upload speed and minimized local disk usage during 
uploads.
+Here are the configurations user can tune related to this feature:
+
+* `s3fileio.multipart.num-threads`: number of threads to use for uploading 
parts to S3 (shared across all output streams), defaults to the available 
number of processors in the system
+* `s3fileio.multipart.part.size`: the size of a single part for multipart 
upload requests, defaults to 32MB
+* `s3fileio.multipart.threshold`: the threshold expressed as a factor times 
the multipart size at which to switch from uploading using a single put object 
request to uploading using multipart upload, defaults to 1.5
+* `s3fileio.staging.dir`: the directory to hold temporary files, defaults to 
Java's `java.io.tmpdir` property value

Review comment:
       Great info here, but I think it may be easier to maintain as a table.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1891: AWS: documentation page for AWS module

Reply via email to