[iceberg] branch master updated: Docs: Add docs for dynamic class loading (#1737)

blue Fri, 06 Nov 2020 15:29:16 -0800

This is an automated email from the ASF dual-hosted git repository.

blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git



The following commit(s) were added to refs/heads/master by this push:
     new 9af545e  Docs: Add docs for dynamic class loading (#1737)
9af545e is described below

commit 9af545ed56343b2fa09966166f8da3e7a24100d7
Author: jackye1995 <[email protected]>
AuthorDate: Fri Nov 6 15:28:22 2020 -0800

    Docs: Add docs for dynamic class loading (#1737)
    
    Co-authored-by: Jack Ye <[email protected]>
---
 site/docs/configuration.md  |  21 +++++++-
 site/docs/custom-catalog.md | 117 +++++++++++++++++++++++++++++++++++++++++++-
 site/docs/flink.md          |  17 +++++++
 site/docs/spark.md          |  10 ++++
 4 files changed, 162 insertions(+), 3 deletions(-)

diff --git a/site/docs/configuration.md b/site/docs/configuration.md
index f0aaa9b..7805db9 100644
--- a/site/docs/configuration.md
+++ b/site/docs/configuration.md
@@ -17,6 +17,24 @@
 
 # Configuration
 
+## Catalog properties
+
+Iceberg catalogs support using catalog properties to configure catalog 
behaviors. Here is a list of commonly used catalog properties:
+
+| Property                          | Default            | Description         
                                   |
+| --------------------------------- | ------------------ | 
------------------------------------------------------ |
+| catalog-impl                      | null               | a custom `Catalog` 
implementation to use by an engine  |
+| io-impl                           | null               | a custom `FileIO` 
implementation to use in a catalog   |
+| warehouse                         | null               | the root path of 
the data warehouse                    |
+| uri                               | null               | (Hive catalog only) 
the Hive metastore URI             |
+| clients                           | 2                  | (Hive catalog only) 
the Hive client pool size          |
+
+`HadoopCatalog` and `HiveCatalog` can access the properties in their 
constructors.
+Any other custom catalog can access the properties by implementing 
`Catalog.initialize(catalogName, catalogProperties)`.
+The properties can be manually constructed or passed in from a compute engine 
like Spark or Flink.
+Spark uses its session properties as catalog properties, see more details in 
the [Spark configuration](#spark-configuration) section.
+Flink passes in catalog properties through `CREATE CATALOG` statement, see 
more details in the [Flink](../flink/#creating-catalogs-and-using-catalogs) 
section.
+
 ## Table properties
 
 Iceberg tables support table properties to configure table behavior, like the 
default split size for readers.
@@ -95,7 +113,8 @@ Both catalogs are configured using properties nested under 
the catalog name:
 
 | Property                                           | Values                  
      | Description                                                          |
 | -------------------------------------------------- | 
----------------------------- | 
-------------------------------------------------------------------- |
-| spark.sql.catalog._catalog-name_.type              | hive or hadoop          
      | The underlying Iceberg catalog implementation                        |
+| spark.sql.catalog._catalog-name_.type              | `hive` or `hadoop`      
      | The underlying Iceberg catalog implementation, `HiveCatalog` or 
`HadoopCatalog` |
+| spark.sql.catalog._catalog-name_.catalog-impl      |                         
      | The underlying Iceberg catalog implementation. When set, the value of 
`type` property is ignored |
 | spark.sql.catalog._catalog-name_.default-namespace | default                 
      | The default current namespace for the catalog                        |
 | spark.sql.catalog._catalog-name_.uri               | thrift://host:port      
      | URI for the Hive Metastore; default from `hive-site.xml` (Hive only) |
 | spark.sql.catalog._catalog-name_.warehouse         | 
hdfs://nn:8020/warehouse/path | Base path for the warehouse directory (Hadoop 
only)                  |
diff --git a/site/docs/custom-catalog.md b/site/docs/custom-catalog.md
index aad67d7..0ccf842 100644
--- a/site/docs/custom-catalog.md
+++ b/site/docs/custom-catalog.md
@@ -20,7 +20,9 @@
 It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
 
 - [Custom TableOperations](#custom-table-operations-implementation)
-- [Custom Catalog](#custom-table-implementation)
+- [Custom Catalog](#custom-catalog-implementation)
+- [Custom FileIO](#custom-file-io-implementation)
+- [Custom LocationProvider](#custom-location-provider-implementation)
 - [Custom IcebergSource](#custom-icebergsource)
 
 ### Custom table operations implementation
@@ -76,7 +78,10 @@ class CustomTableOperations extends 
BaseMetastoreTableOperations {
 }
 ```
 
-### Custom table implementation
+A `TableOperations` instance is usually obtained by calling 
`Catalog.newTableOps(TableIdentifier)`.
+See the next section about implementing and loading a custom catalog.
+
+### Custom catalog implementation
 Extend `BaseMetastoreCatalog` to provide default warehouse locations and 
instantiate `CustomTableOperations`
 
 Example:
@@ -85,6 +90,11 @@ public class CustomCatalog extends BaseMetastoreCatalog {
 
   private Configuration configuration;
 
+  // must have a no-arg constructor to be dynamically loaded
+  // initialize(String name, Map<String, String> properties) will be called to 
complete initialization
+  public CustomCatalog() {
+  }
+
   public CustomCatalog(Configuration configuration) {
     this.configuration = configuration;
   }
@@ -127,8 +137,111 @@ public class CustomCatalog extends BaseMetastoreCatalog {
     // Example service to rename table
     CustomService.renameTable(from.namepsace().level(0), from.name(), 
to.name());
   }
+
+  // implement this method to read catalog name and properties during 
initialization
+  public void initialize(String name, Map<String, String> properties) {
+  }
 }
 ```
+
+Catalog implementations can be dynamically loaded in most compute engines.
+For Spark and Flink, you can specify the `catalog-impl` catalog property to 
load it.
+Read the [Configuration](../configuration/#catalog-properties) section for 
more details.
+For MapReduce, implement `org.apache.iceberg.mr.CatalogLoader` and set Hadoop 
property `iceberg.mr.catalog.loader.class` to load it.
+If your catalog must read Hadoop configuration to access certain environment 
properties, make your catalog implement `org.apache.hadoop.conf.Configurable`.
+
+### Custom file IO implementation
+
+Extend `FileIO` and provide implementation to read and write data files
+
+Example:
+```java
+public class CustomFileIO implements FileIO {
+
+  // must have a no-arg constructor to be dynamically loaded
+  // initialize(Map<String, String> properties) will be called to complete 
initialization
+  public CustomFileIO() {
+  }
+
+  @Override
+  public InputFile newInputFile(String s) {
+    // you also need to implement the InputFile interface for a custom input 
file
+    return new CustomInputFile(s);
+  }
+
+  @Override
+  public OutputFile newOutputFile(String s) {
+    // you also need to implement the OutputFile interface for a custom output 
file
+    return new CustomOutputFile(s);
+  }
+
+  @Override
+  public void deleteFile(String path) {
+    Path toDelete = new Path(path);
+    FileSystem fs = Util.getFs(toDelete);
+    try {
+        fs.delete(toDelete, false /* not recursive */);
+    } catch (IOException e) {
+        throw new RuntimeIOException(e, "Failed to delete file: %s", path);
+    }
+  }
+
+  // implement this method to read catalog properties during initialization
+  public void initialize(Map<String, String> properties) {
+  }
+}
+```
+
+If you are already implementing your own catalog, you can implement 
`TableOperations.io()` to use your custom `FileIO`.
+In addition, custom `FileIO` implementations can also be dynamically loaded in 
`HadoopCatalog` and `HiveCatalog` by specifying the `io-impl` catalog property.
+Read the [Configuration](../configuration/#catalog-properties) section for 
more details.
+If your `FileIO` must read Hadoop configuration to access certain environment 
properties, make your `FileIO` implement `org.apache.hadoop.conf.Configurable`.
+
+### Custom location provider implementation
+
+Extend `LocationProvider` and provide implementation to determine the file 
path to write data
+
+Example:
+```java
+public class CustomLocationProvider implements LocationProvider {
+
+  private String tableLocation;
+
+  // must have a 2-arg constructor like this, or a no-arg constructor
+  public CustomLocationProvider(String tableLocation, Map<String, String> 
properties) {
+    this.tableLocation = tableLocation;
+  }
+
+  @Override
+  public String newDataLocation(String filename) {
+    // can use any custom method to generate a file path given a file name
+    return String.format("%s/%s/%s", tableLocation, 
UUID.randomUUID().toString(), filename);
+  }
+
+  @Override
+  public String newDataLocation(PartitionSpec spec, StructLike partitionData, 
String filename) {
+    // can use any custom method to generate a file path given a partition 
info and file name
+    return newDataLocation(filename);
+  }
+}
+```
+
+If you are already implementing your own catalog, you can override 
`TableOperations.locationProvider()` to use your custom default 
`LocationProvider`.
+To use a different custom location provider for a specific table, specify the 
implementation when creating the table using table property 
`write.location-provider.impl`
+
+Example:
+```sql
+CREATE TABLE hive.default.my_table (
+  id bigint,
+  data string,
+  category string)
+USING iceberg
+OPTIONS (
+  'write.location-provider.impl'='com.my.CustomLocationProvider'
+)
+PARTITIONED BY (category);
+```
+
 ### Custom IcebergSource
 Extend `IcebergSource` and provide implementation to read from `CustomCatalog`
 
diff --git a/site/docs/flink.md b/site/docs/flink.md
index 1dc34cd..044e707 100644
--- a/site/docs/flink.md
+++ b/site/docs/flink.md
@@ -89,6 +89,8 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
 
 Flink 1.11 support to create catalogs by using flink sql.
 
+### Hive catalog
+
 This creates an iceberg catalog named `hive_catalog` that loads tables from a 
hive metastore:
 
 ```sql
@@ -110,6 +112,8 @@ CREATE CATALOG hive_catalog WITH (
 * `warehouse`: The Hive warehouse location, users should specify this path if 
neither set the `hive-conf-dir` to specify a location containing a 
`hive-site.xml` configuration file nor add a correct `hive-site.xml` to 
classpath.
 * `hive-conf-dir`: Path to a directory containing a `hive-site.xml` 
configuration file which will be used to provide custom Hive configuration 
values. The value of `hive.metastore.warehouse.dir` from 
`<hive-conf-dir>/hive-site.xml` (or hive configure file from classpath) will be 
overwrote with the `warehouse` value if setting both `hive-conf-dir` and 
`warehouse` when creating iceberg catalog.
 
+### Hadoop catalog
+
 Iceberg also supports a directory-based catalog in HDFS that can be configured 
using `'catalog-type'='hadoop'`:
 
 ```sql
@@ -125,6 +129,19 @@ CREATE CATALOG hadoop_catalog WITH (
 
 We could execute the sql command `USE CATALOG hive_catalog` to set the current 
catalog.
 
+### Custom catalog
+
+Flink also supports loading a custom Iceberg `Catalog` implementation by 
specifying the `catalog-impl` property.
+When `catalog-impl` is set, the value of `catalog-type` is ignored. Here is an 
example:
+
+```sql
+CREATE CATALOG my_catalog WITH (
+  'type'='iceberg',
+  'catalog-impl'='com.my.custom.CatalogImpl',
+  'my-additional-catalog-config'='my-value'
+);
+```
+
 ## DDL commands
 
 ### `CREATE DATABASE`
diff --git a/site/docs/spark.md b/site/docs/spark.md
index 0433d1f..1da881b 100644
--- a/site/docs/spark.md
+++ b/site/docs/spark.md
@@ -89,6 +89,16 @@ Spark's built-in catalog supports existing v1 and v2 tables 
tracked in a Hive Me
 
 This configuration can use same Hive Metastore for both Iceberg and 
non-Iceberg tables.
 
+### Loading a custom catalog
+
+Spark supports loading a custom Iceberg `Catalog` implementation by specifying 
the `catalog-impl` property.
+When `catalog-impl` is set, the value of `type` is ignored. Here is an example:
+
+```plain
+spark.sql.catalog.custom_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.custom_prod.catalog-impl = com.my.custom.CatalogImpl
+spark.sql.catalog.custom_prod.my-additional-catalog-config = my-value
+```
 
 ## DDL commands

[iceberg] branch master updated: Docs: Add docs for dynamic class loading (#1737)

Reply via email to