markhoerth commented on code in PR #10539: URL: https://github.com/apache/gravitino/pull/10539#discussion_r2992471147
########## design/aws-glue-catalog-connector.md: ########## @@ -0,0 +1,592 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Design: AWS Glue Data Catalog Support for Apache Gravitino + +## 1. Problem Statement and Goals + +### 1.1 Problem + +**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a significant gap because: + +1. **Large user base on AWS**: The majority of cloud-native data lakes run on AWS with Glue Data Catalog as the central metadata service (default for Athena, Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their Glue metadata into Gravitino's unified management layer. +2. **No native integration path**: The only workaround is pointing Gravitino's Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = thrift://...`), which is undocumented, region-limited, and cannot leverage Glue-native features (catalog ID, cross-account access, VPC endpoints). +3. **Competitive landscape**: Trino, Spark, and other engines all have first-class Glue support with dedicated configuration. Users expect the same from Gravitino. + +### 1.2 Goals + +After this feature is implemented: + +1. **Register AWS Glue Data Catalog in Gravitino**: + ```bash + # Hive-format tables + gcli catalog create --name hive_on_glue --provider hive \ + --properties metastore-type=glue,s3-region=us-east-1 + + # Iceberg-format tables + gcli catalog create --name iceberg_on_glue --provider lakehouse-iceberg \ + --properties catalog-backend=glue,warehouse=s3://bucket/iceberg,s3-region=us-east-1 + ``` + +2. **Standard Gravitino API works against Glue catalogs**: + ```bash + gcli schema list --catalog hive_on_glue + gcli table list --catalog hive_on_glue --schema my_database + gcli table details --catalog iceberg_on_glue --schema analytics --table events + ``` + +3. **Trino and Spark connect transparently** — Trino uses `hive.metastore=glue` / `iceberg.catalog.type=glue`; Spark uses `AWSGlueDataCatalogHiveClientFactory` / `GlueCatalog`. Users query Glue tables through Gravitino without knowing the underlying mechanism. + +4. **AWS-native authentication** (reuses existing S3 properties): static credentials, STS AssumeRole, or default credential chain (environment variables, instance profile). + +## 2. Background + +### 2.1 AWS Glue Data Catalog + +AWS Glue Data Catalog is a managed metadata repository storing: +- **Databases** — logical groupings, equivalent to Gravitino schemas. +- **Tables** — metadata records containing column definitions, storage descriptors, partition keys, and user-defined parameters. + +Tables come in two formats: + +| Format | How Glue Stores It | +|---|---| +| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, InputFormat, OutputFormat, location). The majority of tables in most Glue catalogs (legacy ETL, Athena CTAS, Redshift Spectrum). | +| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and `Parameters["metadata_location"]` pointing to Iceberg metadata JSON on S3. `StorageDescriptor.Columns` is typically empty. Growing rapidly. | + +A complete Glue integration must handle both table formats. + +### 2.2 How Query Engines Use Glue + +Trino and Spark both have native Glue support — they call the AWS Glue SDK directly, not via HMS Thrift: + +| Engine | Hive Tables on Glue | Iceberg Tables on Glue | +|---|---|---| +| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector with `iceberg.catalog.type=glue` | +| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` | + +Both engines use a **one-catalog-to-one-connector** model — a single catalog handles either Hive-format or Iceberg-format tables, not both. This is consistent with Gravitino's existing catalog model. + +### 2.3 Gravitino's Current Architecture + +Gravitino's catalog plugin system provides: +- **Hive catalog** (`provider=hive`): Connects to HMS via Thrift. Client chain: `HiveCatalogOperations` → `CachedClientPool` → `HiveClientImpl` → `HiveShimV2/V3` → `IMetaStoreClient`. +- **Iceberg catalog** (`provider=lakehouse-iceberg`): Supports pluggable backends (`catalog-backend=hive|jdbc|rest|memory|custom`). Each backend maps to a different Iceberg `Catalog` implementation. +- **Trino/Spark connectors**: Property converters translate Gravitino catalog properties into engine-specific properties. + +## 3. Design Alternatives + +### Alternative A: New `catalog-glue` Module + +Create a standalone `catalogs/catalog-glue/` with its own `GlueCatalogOperations`, type converters, and entity classes. Directly call the AWS Glue SDK for both Hive and Iceberg tables. + +**Pros**: Full control over Glue-specific behavior. Single catalog for mixed table formats. +**Cons**: +- Duplicates logic already in Hive catalog (type conversion, partition handling, SerDe parsing) and Iceberg catalog (schema conversion, metadata loading). +- Trino/Spark integration requires a "Composite Connector" that routes queries based on table type — a significant architectural change. +- Larger implementation surface area and maintenance burden. + +### Alternative B: Glue as a Metastore Type (Chosen) + +Extend the existing Hive and Iceberg catalogs with Glue as a backend option. + +**Pros**: +- Reuses all existing catalog logic, type conversion, property handling, and entity models. +- Trino/Spark integration works almost for free — both engines already have native Glue support. +- Much smaller change set (~15 files modified, 1 new file vs. ~15 new files). +- Consistent with how Trino and Spark model Glue (as a metastore variant, not a separate catalog type). + +**Cons**: +- Users must create two Gravitino catalogs to cover both Hive and Iceberg tables from the same Glue Data Catalog. +- Cannot add Glue-only features (e.g., Glue crawlers) without extending the generic interfaces. + +**Decision**: Alternative B — the reuse benefits and Trino/Spark alignment outweigh the minor UX cost of two catalogs. + +## 4. Detailed Design + +### 4.1 Configuration Properties + +Gravitino already defines standardized AWS/S3 properties in `S3Properties.java`: + +| Existing Property | Used By | +|---|---| +| `s3-access-key-id` / `s3-secret-access-key` | Iceberg, Hive (S3 storage + Glue auth) | +| `s3-region` | Iceberg, Hive (S3 storage + Glue region) | +| `s3-role-arn` / `s3-external-id` | Iceberg, Hive (STS AssumeRole) | +| `s3-endpoint` | Iceberg, Hive (custom S3 endpoint) | + +We **reuse `s3-region` as the default AWS region for both Glue and S3** and **reuse `s3-access-key-id` / `s3-secret-access-key` for authentication**. These properties already exist in `S3Properties.java` and are already handled by both the Hive and Iceberg catalogs — no new code is required for credential plumbing. + +Only two new Glue-specific properties are needed (prefixed with `aws-glue-` to clearly indicate they are AWS Glue Data Catalog settings, distinct from the generic `s3-` storage properties): + +| New Property | Required | Default | Description | +|---|---|---|---| +| `aws-glue-catalog-id` | No | Caller's AWS account ID | Glue catalog ID. For cross-account access. | +| `aws-glue-endpoint` | No | AWS default regional endpoint | Custom Glue endpoint URL (for VPC endpoints or LocalStack testing). | + +No other Glue-specific properties are needed — all authentication and region settings are covered by existing `s3-*` properties. + +**Authentication priority** (existing implementation in `S3Properties`, reused as-is): Static credentials (`s3-access-key-id` + `s3-secret-access-key`) → STS AssumeRole (`s3-role-arn`) → Default credential chain (environment variables, instance profile). The mapping from `s3-*` properties to AWS SDK / Glue SDK credentials is done in the property conversion layer (Section 4.2 and 4.3). + +### 4.2 Iceberg Catalog + Glue Backend + +Add `GLUE` as a new `IcebergCatalogBackend` enum value. Use Iceberg's built-in `org.apache.iceberg.aws.glue.GlueCatalog`. + +#### Data Flow + +``` +User: catalog-backend=glue, warehouse=s3://..., s3-region=us-east-1 + → IcebergCatalogOperations.initialize() + → IcebergCatalogUtil.loadCatalogBackend(GLUE, config) + → loadGlueCatalog(config) + → new GlueCatalog().initialize("glue", { + "warehouse": "s3://...", + "client.region": "us-east-1", + "glue.catalog-id": "..." }) + → All existing IcebergCatalogOperations methods work unchanged +``` + +`GlueCatalog` is an official Iceberg implementation with full Schema CRUD + Table CRUD support — this is the lowest-risk part of the design. + +#### Engine Integration + +**Trino** — Add `case "glue":` in `IcebergCatalogPropertyConverter.gravitinoToEngineProperties()`: + +```java +// In IcebergCatalogPropertyConverter.java: +case "glue": + icebergProperties.put("iceberg.catalog.type", "glue"); + // Map Gravitino s3-region → Trino glue.region + String region = properties.get("s3-region"); + if (region != null) { + icebergProperties.put("hive.metastore.glue.region", region); + } + // Map aws-glue-catalog-id → Trino glue.catalog-id + String catalogId = properties.get("aws-glue-catalog-id"); + if (catalogId != null) { + icebergProperties.put("hive.metastore.glue.catalogid", catalogId); + } + break; +``` + +**Spark** — No code change needed. The existing `IcebergPropertiesConverter` does a generic passthrough: `all.put(ICEBERG_CATALOG_TYPE, catalogBackend)` already passes `"glue"` to Spark's Iceberg catalog, which natively supports `GlueCatalog`. + +### 4.3 Hive Catalog + Glue Backend + +Add `metastore-type=glue` property (Gravitino user-facing key). During `HiveCatalogOperations.initialize()`, this is mapped to the Hive-internal property `metastore.type=glue` via the `GRAVITINO_CONFIG_TO_HIVE` mapping. All Java code snippets below use the Hive-internal key `metastore.type`. Use AWS's `aws-glue-datacatalog-hive3-client` library which provides an `IMetaStoreClient` implementation backed by the Glue SDK. + +#### Data Flow + +``` +User: metastore-type=glue, s3-region=us-east-1 + → HiveCatalogOperations.initialize() + → mergeProperties(conf) — maps Glue properties + → CachedClientPool(properties) + → HiveClientPool.newClient() + → HiveClientFactory.createHiveClient() ← MODIFIED: skip hive2/3 detection + → HiveClientClassLoader.createLoader(HIVE3, ...) ← always Hive3 for Glue + → HiveClientImpl(HIVE3, properties) + → detects metastore.type=glue + → new GlueShim(properties) ← NEW (replaces HiveShimV3) + → createMetaStoreClient() + → AWSGlueDataCatalogHiveClientFactory.create(hiveConf) + → returns AWSCatalogMetastoreClient (implements IMetaStoreClient) + → All existing HiveCatalogOperations methods work unchanged +``` + +#### Hive Version Resolution + +**Problem**: `HiveClientFactory.createHiveClientWithBackend()` currently probes the remote HMS to detect Hive2 vs Hive3 at runtime — it calls `getCatalogs()` and falls back to Hive2 if the RPC fails. This probe-and-fallback approach has two issues for Glue: (1) there is no remote HMS to probe, and (2) the version is already known from the catalog configuration. + +**Solution**: Extract a `resolveHiveVersion(Properties)` method that determines the Hive version from catalog configuration, avoiding runtime probing when possible: + +```java +// In HiveClientFactory: + +/** + * Resolves Hive version from catalog configuration. + * Returns UNKNOWN when version cannot be determined statically (HMS mode). + */ +private HiveVersion resolveHiveVersion(Properties properties) { + String metastoreType = properties.getProperty("metastore.type", "hive"); + if ("glue".equalsIgnoreCase(metastoreType)) { + return HiveVersion.HIVE3; // Glue always uses Hive3 + } + return HiveVersion.UNKNOWN; // HMS: let createHiveClient() probe at runtime +} +``` + +When `resolveHiveVersion()` returns `UNKNOWN`, `createHiveClient()` falls into the existing probe-and-fallback path (`createHiveClientWithBackend()`). When it returns a concrete version (`HIVE3`), the probe is skipped entirely. + +This design: +- **Eliminates hardcoding**: version resolution is centralized in one method, driven by catalog configuration. +- **Is extensible**: future backends (e.g., Hive2 Glue client) can add new branches to `resolveHiveVersion()` without modifying `createHiveClient()`. +- **Preserves existing behavior**: for HMS metastore (`metastore.type=hive` or unset), the existing probe-and-fallback logic is unchanged — just extracted into `probeHmsVersion()`. + +**Why Hive3 for Glue?** AWS Glue Data Catalog is a managed service with a single API version — there is no concept of Hive2 vs Hive3 on the server side. We choose the Hive3 classloader because: +1. **JAR location**: `HiveClientClassLoader.getJarDirectory()` maps `HIVE3` → `hive-metastore3-libs/`, where the Glue client JAR is placed (see Section 4.4). +2. **Active maintenance**: AWS's `aws-glue-datacatalog-hive3-client` is the actively maintained variant. The Hive2 client is legacy. +3. **API compatibility**: The `IMetaStoreClient` interface differs between Hive2 and Hive3 (Hive3 adds catalog-aware methods). The Glue client JAR must match the Hive version of the classloader it is loaded into. + +#### GlueShim Design + +`GlueShim` extends `HiveShimV3` and overrides only `createMetaStoreClient()`: + +| Shim | Parent | `createMetaStoreClient()` | Calling Convention | +|---|---|---|---| +| `HiveShimV2` | `HiveShim` | `RetryingMetaStoreClient.getProxy(hiveConf)` → Thrift HMS | 2-arg: `getDatabase(db)` | +| `HiveShimV3` | `HiveShimV2` | Same as V2 | 3-arg: `getDatabase(catalog, db)` — catalog-aware | +| `GlueShim` | `HiveShimV3` | `AWSGlueDataCatalogHiveClientFactory.create(hiveConf)` → Glue SDK | Inherits HiveShimV3's 3-arg convention | + +**Why extend `HiveShimV3`?** GlueShim uses the Hive3 classloader and `aws-glue-datacatalog-hive3-client`, which implements the Hive3 version of `IMetaStoreClient`. `HiveShimV3` provides the correct 3-arg calling convention (catalog-aware method signatures) that matches this interface. Extending `HiveShimV2` would use the 2-arg convention, which would not match the Hive3 `IMetaStoreClient` loaded by the Hive3 classloader. + +All three return `IMetaStoreClient`. `HiveClientImpl` selects the shim based on `metastore.type`: + +```java +// In HiveClientImpl constructor: +String metastoreType = properties.getProperty("metastore.type", "hive"); +if ("glue".equalsIgnoreCase(metastoreType)) { + shim = new GlueShim(properties); // extends HiveShimV3 +} else { + switch (hiveVersion) { + case HIVE2: shim = new HiveShimV2(properties); break; + case HIVE3: shim = new HiveShimV3(properties); break; + } +} +``` + +All upstream code (`HiveClientPool`, `CachedClientPool`, `HiveCatalogOperations`) is unchanged — it programs against the `HiveClient` interface. + +#### IMetaStoreClient Relationship + +``` +org.apache.hadoop.hive.metastore.IMetaStoreClient ← Hive standard interface + ├── HiveMetaStoreClient (Thrift impl, connects to HMS) + └── AWSCatalogMetastoreClient (Glue impl, via AWS Glue SDK) + └── Created by AWSGlueDataCatalogHiveClientFactory.create(hiveConf) +``` + +`AWSCatalogMetastoreClient` is a drop-in replacement for `HiveMetaStoreClient`. All upstream code is completely unaware of the difference. + +#### Engine Integration + +**Trino** — Add Glue branch in `HiveConnectorAdapter.buildInternalConnectorConfig()`: + +```java +// In HiveConnectorAdapter.java: +String metastoreType = catalog.getProperty("metastore-type", "hive"); +if ("glue".equalsIgnoreCase(metastoreType)) { + // Use Trino's native Glue metastore support + config.put("hive.metastore", "glue"); + String region = catalog.getProperty("s3-region"); + if (region != null) { + config.put("hive.metastore.glue.region", region); + } + String catalogId = catalog.getProperty("aws-glue-catalog-id"); + if (catalogId != null) { + config.put("hive.metastore.glue.catalogid", catalogId); + } +} else { + // Existing HMS path — unchanged + config.put("hive.metastore.uri", catalog.getRequiredProperty("metastore.uris")); +} +``` + +**Spark** — Add Glue branch in `HivePropertiesConverter.toSparkCatalogProperties()`: + +```java +// In HivePropertiesConverter.java: +String metastoreType = properties.get("metastore-type"); +if ("glue".equalsIgnoreCase(metastoreType)) { + // Use AWS Glue Data Catalog as Hive metastore for Spark + sparkProperties.put("spark.hadoop.hive.metastore.client.factory.class", + "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"); + String region = properties.get("s3-region"); + if (region != null) { + sparkProperties.put("spark.hadoop.aws.region", region); + } +} else { + // Existing HMS path — unchanged + sparkProperties.put(SPARK_HIVE_METASTORE_URI, properties.get(GRAVITINO_HIVE_METASTORE_URI)); +} +``` + +### 4.4 Dependency Management + +#### Iceberg + Glue + +| Dependency | Target Module | Scope | +|---|---|---| +| `org.apache.iceberg:iceberg-aws` — Contains `GlueCatalog` implementation. Transitively depends on `software.amazon.awssdk:glue`. Already in version catalog as `libs.iceberg.aws`. | `iceberg/iceberg-common/build.gradle.kts` | `compileOnly` (provided at runtime by `bundles/iceberg-aws-bundle`) | + +No changes to `gradle/libs.versions.toml` required. + +#### Hive + Glue + +| Dependency | Target Module | Scope | +|---|---|---| +| `com.amazonaws:aws-glue-datacatalog-hive3-client` — Implements `IMetaStoreClient` via Glue SDK. Provides `AWSGlueDataCatalogHiveClientFactory`. | `catalogs/hive-metastore3-libs/build.gradle.kts` | `implementation` (packaged into `hive-metastore3-libs/`) | + +**Why `hive-metastore3-libs`?** The Hive catalog uses `HiveClientClassLoader` for class isolation — it loads JARs from `hive-metastore2-libs/` or `hive-metastore3-libs/`. GlueShim uses the Hive3 classloader (see Section 4.3), so the Glue client JAR must be in `hive-metastore3-libs`. + +### 4.5 End-to-End Architecture + +``` + Gravitino Server + | + +------ provider=hive ------+------- provider=lakehouse-iceberg ------+ + | metastore-type=glue | catalog-backend=glue | + | | | + HiveCatalogOperations IcebergCatalogOperations + | | + HiveClientImpl IcebergCatalogUtil + -> GlueShim -> loadGlueCatalog() + -> AWSCatalogMetastoreClient -> org.apache.iceberg.aws.glue.GlueCatalog + (impl IMetaStoreClient) (impl org.apache.iceberg.catalog.Catalog) + | | + +-------- AWS Glue SDK -----+ + | + AWS Glue Data Catalog + | + +----------+----------+ + | | + Hive Tables Iceberg Tables + (StorageDescriptor) (metadata_location) + + + Query Engines + | + +---- Trino ----+ +---- Spark ----+ + | | | | + Hive Connector Iceberg Connector HiveCatalog SparkCatalog + metastore=glue catalog.type=glue factory=AWS catalog-impl=GlueCatalog +``` Review Comment: The diagram in Section 4.5 does not show a connection between the Gravitino Server and the Query Engines. Can you clarify whether Trino and Spark are connecting to Gravitino as a metadata proxy at query time, or connecting directly to Glue? This has significant implications for whether Gravitino's governance model applies to query execution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
