This is an automated email from the ASF dual-hosted git repository.
blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/master by this push:
new 9af545e Docs: Add docs for dynamic class loading (#1737)
9af545e is described below
commit 9af545ed56343b2fa09966166f8da3e7a24100d7
Author: jackye1995 <[email protected]>
AuthorDate: Fri Nov 6 15:28:22 2020 -0800
Docs: Add docs for dynamic class loading (#1737)
Co-authored-by: Jack Ye <[email protected]>
---
site/docs/configuration.md | 21 +++++++-
site/docs/custom-catalog.md | 117 +++++++++++++++++++++++++++++++++++++++++++-
site/docs/flink.md | 17 +++++++
site/docs/spark.md | 10 ++++
4 files changed, 162 insertions(+), 3 deletions(-)
diff --git a/site/docs/configuration.md b/site/docs/configuration.md
index f0aaa9b..7805db9 100644
--- a/site/docs/configuration.md
+++ b/site/docs/configuration.md
@@ -17,6 +17,24 @@
# Configuration
+## Catalog properties
+
+Iceberg catalogs support using catalog properties to configure catalog
behaviors. Here is a list of commonly used catalog properties:
+
+| Property | Default | Description
|
+| --------------------------------- | ------------------ |
------------------------------------------------------ |
+| catalog-impl | null | a custom `Catalog`
implementation to use by an engine |
+| io-impl | null | a custom `FileIO`
implementation to use in a catalog |
+| warehouse | null | the root path of
the data warehouse |
+| uri | null | (Hive catalog only)
the Hive metastore URI |
+| clients | 2 | (Hive catalog only)
the Hive client pool size |
+
+`HadoopCatalog` and `HiveCatalog` can access the properties in their
constructors.
+Any other custom catalog can access the properties by implementing
`Catalog.initialize(catalogName, catalogProperties)`.
+The properties can be manually constructed or passed in from a compute engine
like Spark or Flink.
+Spark uses its session properties as catalog properties, see more details in
the [Spark configuration](#spark-configuration) section.
+Flink passes in catalog properties through `CREATE CATALOG` statement, see
more details in the [Flink](../flink/#creating-catalogs-and-using-catalogs)
section.
+
## Table properties
Iceberg tables support table properties to configure table behavior, like the
default split size for readers.
@@ -95,7 +113,8 @@ Both catalogs are configured using properties nested under
the catalog name:
| Property | Values
| Description |
| -------------------------------------------------- |
----------------------------- |
-------------------------------------------------------------------- |
-| spark.sql.catalog._catalog-name_.type | hive or hadoop
| The underlying Iceberg catalog implementation |
+| spark.sql.catalog._catalog-name_.type | `hive` or `hadoop`
| The underlying Iceberg catalog implementation, `HiveCatalog` or
`HadoopCatalog` |
+| spark.sql.catalog._catalog-name_.catalog-impl |
| The underlying Iceberg catalog implementation. When set, the value of
`type` property is ignored |
| spark.sql.catalog._catalog-name_.default-namespace | default
| The default current namespace for the catalog |
| spark.sql.catalog._catalog-name_.uri | thrift://host:port
| URI for the Hive Metastore; default from `hive-site.xml` (Hive only) |
| spark.sql.catalog._catalog-name_.warehouse |
hdfs://nn:8020/warehouse/path | Base path for the warehouse directory (Hadoop
only) |
diff --git a/site/docs/custom-catalog.md b/site/docs/custom-catalog.md
index aad67d7..0ccf842 100644
--- a/site/docs/custom-catalog.md
+++ b/site/docs/custom-catalog.md
@@ -20,7 +20,9 @@
It's possible to read an iceberg table either from an hdfs path or from a hive
table. It's also possible to use a custom metastore in place of hive. The steps
to do that are as follows.
- [Custom TableOperations](#custom-table-operations-implementation)
-- [Custom Catalog](#custom-table-implementation)
+- [Custom Catalog](#custom-catalog-implementation)
+- [Custom FileIO](#custom-file-io-implementation)
+- [Custom LocationProvider](#custom-location-provider-implementation)
- [Custom IcebergSource](#custom-icebergsource)
### Custom table operations implementation
@@ -76,7 +78,10 @@ class CustomTableOperations extends
BaseMetastoreTableOperations {
}
```
-### Custom table implementation
+A `TableOperations` instance is usually obtained by calling
`Catalog.newTableOps(TableIdentifier)`.
+See the next section about implementing and loading a custom catalog.
+
+### Custom catalog implementation
Extend `BaseMetastoreCatalog` to provide default warehouse locations and
instantiate `CustomTableOperations`
Example:
@@ -85,6 +90,11 @@ public class CustomCatalog extends BaseMetastoreCatalog {
private Configuration configuration;
+ // must have a no-arg constructor to be dynamically loaded
+ // initialize(String name, Map<String, String> properties) will be called to
complete initialization
+ public CustomCatalog() {
+ }
+
public CustomCatalog(Configuration configuration) {
this.configuration = configuration;
}
@@ -127,8 +137,111 @@ public class CustomCatalog extends BaseMetastoreCatalog {
// Example service to rename table
CustomService.renameTable(from.namepsace().level(0), from.name(),
to.name());
}
+
+ // implement this method to read catalog name and properties during
initialization
+ public void initialize(String name, Map<String, String> properties) {
+ }
}
```
+
+Catalog implementations can be dynamically loaded in most compute engines.
+For Spark and Flink, you can specify the `catalog-impl` catalog property to
load it.
+Read the [Configuration](../configuration/#catalog-properties) section for
more details.
+For MapReduce, implement `org.apache.iceberg.mr.CatalogLoader` and set Hadoop
property `iceberg.mr.catalog.loader.class` to load it.
+If your catalog must read Hadoop configuration to access certain environment
properties, make your catalog implement `org.apache.hadoop.conf.Configurable`.
+
+### Custom file IO implementation
+
+Extend `FileIO` and provide implementation to read and write data files
+
+Example:
+```java
+public class CustomFileIO implements FileIO {
+
+ // must have a no-arg constructor to be dynamically loaded
+ // initialize(Map<String, String> properties) will be called to complete
initialization
+ public CustomFileIO() {
+ }
+
+ @Override
+ public InputFile newInputFile(String s) {
+ // you also need to implement the InputFile interface for a custom input
file
+ return new CustomInputFile(s);
+ }
+
+ @Override
+ public OutputFile newOutputFile(String s) {
+ // you also need to implement the OutputFile interface for a custom output
file
+ return new CustomOutputFile(s);
+ }
+
+ @Override
+ public void deleteFile(String path) {
+ Path toDelete = new Path(path);
+ FileSystem fs = Util.getFs(toDelete);
+ try {
+ fs.delete(toDelete, false /* not recursive */);
+ } catch (IOException e) {
+ throw new RuntimeIOException(e, "Failed to delete file: %s", path);
+ }
+ }
+
+ // implement this method to read catalog properties during initialization
+ public void initialize(Map<String, String> properties) {
+ }
+}
+```
+
+If you are already implementing your own catalog, you can implement
`TableOperations.io()` to use your custom `FileIO`.
+In addition, custom `FileIO` implementations can also be dynamically loaded in
`HadoopCatalog` and `HiveCatalog` by specifying the `io-impl` catalog property.
+Read the [Configuration](../configuration/#catalog-properties) section for
more details.
+If your `FileIO` must read Hadoop configuration to access certain environment
properties, make your `FileIO` implement `org.apache.hadoop.conf.Configurable`.
+
+### Custom location provider implementation
+
+Extend `LocationProvider` and provide implementation to determine the file
path to write data
+
+Example:
+```java
+public class CustomLocationProvider implements LocationProvider {
+
+ private String tableLocation;
+
+ // must have a 2-arg constructor like this, or a no-arg constructor
+ public CustomLocationProvider(String tableLocation, Map<String, String>
properties) {
+ this.tableLocation = tableLocation;
+ }
+
+ @Override
+ public String newDataLocation(String filename) {
+ // can use any custom method to generate a file path given a file name
+ return String.format("%s/%s/%s", tableLocation,
UUID.randomUUID().toString(), filename);
+ }
+
+ @Override
+ public String newDataLocation(PartitionSpec spec, StructLike partitionData,
String filename) {
+ // can use any custom method to generate a file path given a partition
info and file name
+ return newDataLocation(filename);
+ }
+}
+```
+
+If you are already implementing your own catalog, you can override
`TableOperations.locationProvider()` to use your custom default
`LocationProvider`.
+To use a different custom location provider for a specific table, specify the
implementation when creating the table using table property
`write.location-provider.impl`
+
+Example:
+```sql
+CREATE TABLE hive.default.my_table (
+ id bigint,
+ data string,
+ category string)
+USING iceberg
+OPTIONS (
+ 'write.location-provider.impl'='com.my.CustomLocationProvider'
+)
+PARTITIONED BY (category);
+```
+
### Custom IcebergSource
Extend `IcebergSource` and provide implementation to read from `CustomCatalog`
diff --git a/site/docs/flink.md b/site/docs/flink.md
index 1dc34cd..044e707 100644
--- a/site/docs/flink.md
+++ b/site/docs/flink.md
@@ -89,6 +89,8 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
Flink 1.11 support to create catalogs by using flink sql.
+### Hive catalog
+
This creates an iceberg catalog named `hive_catalog` that loads tables from a
hive metastore:
```sql
@@ -110,6 +112,8 @@ CREATE CATALOG hive_catalog WITH (
* `warehouse`: The Hive warehouse location, users should specify this path if
neither set the `hive-conf-dir` to specify a location containing a
`hive-site.xml` configuration file nor add a correct `hive-site.xml` to
classpath.
* `hive-conf-dir`: Path to a directory containing a `hive-site.xml`
configuration file which will be used to provide custom Hive configuration
values. The value of `hive.metastore.warehouse.dir` from
`<hive-conf-dir>/hive-site.xml` (or hive configure file from classpath) will be
overwrote with the `warehouse` value if setting both `hive-conf-dir` and
`warehouse` when creating iceberg catalog.
+### Hadoop catalog
+
Iceberg also supports a directory-based catalog in HDFS that can be configured
using `'catalog-type'='hadoop'`:
```sql
@@ -125,6 +129,19 @@ CREATE CATALOG hadoop_catalog WITH (
We could execute the sql command `USE CATALOG hive_catalog` to set the current
catalog.
+### Custom catalog
+
+Flink also supports loading a custom Iceberg `Catalog` implementation by
specifying the `catalog-impl` property.
+When `catalog-impl` is set, the value of `catalog-type` is ignored. Here is an
example:
+
+```sql
+CREATE CATALOG my_catalog WITH (
+ 'type'='iceberg',
+ 'catalog-impl'='com.my.custom.CatalogImpl',
+ 'my-additional-catalog-config'='my-value'
+);
+```
+
## DDL commands
### `CREATE DATABASE`
diff --git a/site/docs/spark.md b/site/docs/spark.md
index 0433d1f..1da881b 100644
--- a/site/docs/spark.md
+++ b/site/docs/spark.md
@@ -89,6 +89,16 @@ Spark's built-in catalog supports existing v1 and v2 tables
tracked in a Hive Me
This configuration can use same Hive Metastore for both Iceberg and
non-Iceberg tables.
+### Loading a custom catalog
+
+Spark supports loading a custom Iceberg `Catalog` implementation by specifying
the `catalog-impl` property.
+When `catalog-impl` is set, the value of `type` is ignored. Here is an example:
+
+```plain
+spark.sql.catalog.custom_prod = org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.custom_prod.catalog-impl = com.my.custom.CatalogImpl
+spark.sql.catalog.custom_prod.my-additional-catalog-config = my-value
+```
## DDL commands