[GitHub] [iceberg] rdblue commented on a diff in pull request #8171: Added documentation on getting started with GCS

via GitHub Wed, 02 Aug 2023 12:50:27 -0700


rdblue commented on code in PR #8171:
URL: https://github.com/apache/iceberg/pull/8171#discussion_r1282353658



##########
docs/gcs.md:
##########
@@ -0,0 +1,160 @@
+---
+title: "GCS"
+url: gcs
+menu:
+    main:
+        parent: Integrations
+        identifier: gcs_integration
+        weight: 0
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements. See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License. You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Iceberg GSC Integration
+
+Google Cloud Storage (GCS) is a scalable object storage service known for its 
durability, throughput, and availability. It is designed to handle large 
amounts of unstructured data, making it an excellent choice for significant 
data operations. When GCS's storage capabilities are combined with Apache 
Iceberg's handling of large tabular datasets, a powerful tool for extensive 
data management is created. This combination allows for storing vast amounts of 
data and the execution of complex data operations directly on the data stored 
in GCS.
+
+## Setting Up Google Cloud Storage (GCS)
+
+### Setting Up a Bucket in GCS
+
+Here's how to create a bucket in GCS:
+
+- **Initialize the Google Cloud CLI**: Run the gcloud init command to start 
the initialization process of Google Cloud CLI, which sets up the Google Cloud 
environment on your local machine.
+
+- **Create a Cloud Storage bucket**: Navigate to the Cloud Storage Buckets 
page in the Google Cloud Console. Click "Create bucket", enter your details, 
and click "Create".
+  
+## Configuring Apache Iceberg to Use GCS
+
+Apache Iceberg uses the GCSFileIO to read and write data from/to GCS. To 
configure this:
+
+- **Initialize `GCSFileIO`**: Create an instance of `GCSFileIO` by calling its 
constructor and supplying a `com.google.cloud.storage.Storage` instance and 
`GCPProperties` instance. This Storage instance will be used as the storage 
service engine, and GCPProperties hold GCP-specific configurations.
+
+```java
+GCSFileIO gcsFileIO = new GCSFileIO(storageSupplier, gcpProperties);
+```
+
+- **Configure GCSFileIO**: Once you have the `GCSFileIO` object, you can 
configure it to use your GCS bucket. You do this by calling the initialize 
method of `GCSFileIO` and passing a map of properties. This map holds the GCS 
bucket name and other required GCS settings.
+
+```java
+Map<String, String> properties = new HashMap<>();
+properties.put("inputDataLocation", "gs://my_bucket/data/");
+properties.put("metadataLocation", "gs://my_bucket/metadata/");
+gcsFileIO.initialize(properties);
+```
+
+Within these property key-value pairs, inputDataLocation and metadataLocation 
are the locations in your GCS bucket where your data and metadata are stored. 
Update `"gs://my_bucket/data/" ` and `"gs://my_bucket/metadata/" ` to reflect 
the corresponding paths of your GCS bucket.
+
+### Example Use of GCSFileIO
+
+Once `GCSFileIO` is initialized and configured, you can interact with the data 
housed on GCS. Below, we will demonstrate how to create and access an 
`InputFile` and an `OutputFile`.
+
+- **Creating an InputFile**: To create an `InputFile` for reading data from 
your GCS bucket, you can use the `newInputFile` method.
+
+```java
+InputFile inputFile = 
gcsFileIO.newInputFile("gs://my_bucket/data/my_data.parquet");
+
+```
+
+Replace `"gs://my_bucket/data/my_data.parquet"` with the path of the data you 
want to read.
+
+- **Creating an OutputFile**: To write data to your GCS bucket, you would 
establish an OutputFile using the newOutputFile method.
+
+```java
+OutputFile outputFile = 
gcsFileIO.newOutputFile("gs://my_bucket/data/my_output.parquet");
+```
+
+Again, replace `"gs://my_bucket/data/my_output.parquet"` with the path where 
you'd like to write your data.
+
+These steps will allow you to set up GCS as your storage layer for Apache 
Iceberg and interact with the data stored in GCS using the `GCSFileIO` class.
+
+## Loading Data into Iceberg Tables
+
+### Add Iceberg to Spark environment
+
+To load data into Iceberg tables using Apache Spark, you must first add 
Iceberg to your Spark environment. It can be done using the `--packages` option 
when starting the Spark shell or Spark SQL:
+
+```bash
+spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
+spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1
+```
+
+### Configure Spark Catalogs
+
+Catalogs in Iceberg are used to track tables. They can be configured using 
properties under `spark.sql.catalog.(catalog_name)`. Here is an example of how 
to configure a catalog:
+
+```bash
+spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.3.1\
+    --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
+    --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
+    --conf spark.sql.catalog.spark_catalog.type=hive \
+    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.local.type=hadoop \
+    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse \
+    --conf spark.sql.defaultCatalog=local

Review Comment:
   `ResolvingFileIO` should [pick it based on the `gs` 
scheme](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/io/ResolvingFileIO.java#L51).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #8171: Added documentation on getting started with GCS

Reply via email to