yihua commented on code in PR #18782: URL: https://github.com/apache/hudi/pull/18782#discussion_r3277831310
########## rfc/rfc-105/rfc-105.md: ########## @@ -0,0 +1,225 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-105: Trino Hudi Connector — Shim/Bundle Refactor + +## Proposers + +- @yihua +- @voonhous + +## Approvers + +- @codope +- @vinothchandar + +## Status + +Issue: [HUDI-18780](https://github.com/apache/hudi/issues/18780) + +> Please keep the status updated in `rfc/README.md`. + +## Motivation + +The Trino-Hudi connector currently lives in `trinodb/trino` at `plugin/trino-hudi`. Maintaining and evolving the connector through the Trino-OSS-only path has stalled in practice, and the cost falls on Hudi users: + +- **Hudi-side improvement PRs to the Trino Hudi connector are not landing.** Four stacked PRs targeting the Trino Hudi connector were closed by Trino's stale-bot for lack of review: + - [trinodb/trino#28518](https://github.com/trinodb/trino/pull/28518) + - [trinodb/trino#28533](https://github.com/trinodb/trino/pull/28533) + - [trinodb/trino#28644](https://github.com/trinodb/trino/pull/28644) + - [trinodb/trino#28645](https://github.com/trinodb/trino/pull/28645) +- **Significant Hudi-side work for the Trino connector is ready but cannot land** through the current path: metadata-table-driven partition listing, eight `HudiIndexSupport` strategies (column stats, partition stats, record-level, secondary, expression, bloom, bucket, partition bloom), MOR snapshot-isolation fixes (worker-side use of the latest commit time from the table handle), and file-system caching integration. +- **The current arrangement does not scale.** Connector evolution must go through Trino-side review for every change, while the expertise and the source-of-truth for Hudi internals live in this project. Hudi releases cannot directly deliver improvements to Hudi users querying via Trino. + +Following alignment between the Hudi and Trino communities, the agreed direction is to split the connector into a thin Trino-side shim plus a Hudi-published artifact carrying the connector logic. This lets the Hudi project ship Trino-Hudi improvements with each Hudi release, while Trino picks them up via a one-line dependency-version bump. + +The single requirement carried over from the Trino side is that a comprehensive test suite for the connector continues to be maintained on the Trino side. This RFC documents the agreed approach and the implementation plan. + +## Abstract + +We split the Trino-Hudi connector into two Maven artifacts: + +1. **`io.trino:trino-hudi`** stays in Trino OSS (`plugin/trino-hudi`) as a thin shim — a `HudiPlugin` class that registers the `io.trino.spi.Plugin` SPI entry point — plus the test harness (smoke tests, query runners, MinIO-backed integration tests). This module mostly does not change once landed. +2. **`org.apache.hudi:hudi-trino`** is a new Hudi-published Maven artifact (regular, non-shaded JAR) containing the actual connector logic at `io.trino.plugin.hudi.*` — `HudiConnectorFactory`, `HudiConnector`, `HudiMetadata`, `HudiSplitManager`, `HudiPageSourceProvider`, all index-support strategies, the `HoodieStorage`/`HoodieIOFactory` bridges to Trino's filesystem, etc. The artifact is built against the latest Trino release's SPI; it declares `hudi-common`, `hudi-io`, etc. as transitive dependencies and Trino's `trino-spi`, `trino-filesystem`, etc. as `provided`. + +The first publication ships in **Hudi 1.3.0**. The Trino-side shim PR pins `org.apache.hudi:hudi-trino:1.3.0`. Going forward, all Trino-Hudi connector evolution happens in Hudi OSS; Trino picks up changes by bumping the dependency version. To support this integration model, **Hudi will increase its release cadence**. + +## Background + +### State of the Trino-side connector today + +`plugin/trino-hudi` in `trinodb/trino` is the baseline: it implements the standard Trino SPI (`Plugin`, `ConnectorFactory`, `Connector`, `ConnectorMetadata`, `ConnectorSplitManager`, `ConnectorPageSourceProvider`, etc.), depends on `hudi-common` and `hudi-io`, and uses Hudi's `HoodieStorage` abstraction (RFC-74) over Trino's `TrinoFileSystem`. No direct Hadoop imports. + +### State of the Hudi-side `hudi-trino-plugin` work + +A more advanced version of the connector exists in Hudi-side branches under `hudi-trino-plugin/` (same `io.trino.plugin.hudi.*` package, built against a recent Trino release). On top of the Trino-OSS baseline it adds: + +- Eight `HudiIndexSupport` strategies (column stats, partition stats, record-level, secondary, expression, bloom, bucket, partition bloom) for file- and partition-level pruning via metadata tables. +- Metadata-table-driven partition discovery (async, resumable). +- MOR record-level merging via `HoodieFileGroupReader` (`HudiTrinoReaderContext`). +- Lazy commit-time on `HudiTableHandle` for snapshot-isolated MOR reads across workers. +- Background, weighted split generation; size-based split weighting; multi-reader routing (`HudiPageSource` for MOR, `HudiBaseFileOnlyPageSource` for COW/RO). +- File-system cache integration. +- HoodieStorage / HoodieIOFactory bridges over `TrinoFileSystem` (`HudiTrinoStorage`, `HudiTrinoInlineStorage`, `HudiTrinoIOFactory`). + +This is the body of code that will move into the `hudi-trino` Maven module on the Hudi side. + +### Why a "shim + Hudi-published artifact" pattern + +This pattern decouples Trino-Hudi connector evolution from the Trino-side release cycle: + +- The Hudi project can publish Trino-Hudi improvements with each Hudi release, without waiting for Trino-side reviews of every change. +- The Trino-side surface shrinks to a stable plugin-registration shim, so Trino-side review burden is minimal — typically a one-line version bump per Hudi release. +- All Hudi-Trino integration code (`io.trino.plugin.hudi.*`) is co-located with the Hudi core libraries it depends on. Changes that cross the Hudi-internal / connector boundary can land atomically. +- The artifact is **purpose-built for Trino** and implements Trino's SPI directly, so no intermediate adapter layer is needed between the published artifact and the Trino plugin. + +Trino's `trino-spi` is governed by `revapi-maven-plugin` (see `core/trino-spi/pom.xml`) which enforces backward compatibility on the SPI surface. This is what makes a single `hudi-trino` artifact targeting the latest Trino release viable across multiple subsequent Trino releases. + +Trino loads each plugin in an isolated `URLClassLoader`. Transitive dependencies of `hudi-trino` (Avro, Parquet, etc.) are isolated to the plugin's classloader and cannot conflict with other plugins. + +## Implementation + +### Architecture + +``` +trinodb/trino : plugin/trino-hudi (packaging = trino-plugin) + HudiPlugin.java ← thin shim: trivial Plugin SPI registration + META-INF/services/io.trino.spi.Plugin + src/test/java/... ← full Trino-side test suite + pom.xml ← depends on org.apache.hudi:hudi-trino:1.3.0 + │ + │ Maven Central + ▼ +apache/hudi : hudi-trino-plugin/ (Maven profile -Phudi-trino, + excluded from default reactor, + JDK 25 required) + io.trino.plugin.hudi.* ← all connector logic: + HudiConnectorFactory, HudiConnector, HudiMetadata, + HudiSplitManager, HudiPageSourceProvider, + cache/, file/, io/, partition/, + query/ (incl. 8 index-support strategies), + reader/, split/, stats/, storage/, util/ + src/test/java/... ← full duplicated + expanded suite + Published as: org.apache.hudi:hudi-trino:1.3.0 +``` + +### What lives where + +#### Trino-side `plugin/trino-hudi` (the shim) + +| File | Purpose | +|---|---| +| `src/main/java/io/trino/plugin/hudi/HudiPlugin.java` | Implements `io.trino.spi.Plugin`. Single method returning `new HudiConnectorFactory()` (from the `hudi-trino` artifact). ~10 lines. | +| `src/main/resources/META-INF/services/io.trino.spi.Plugin` | Service-loader pointer to `io.trino.plugin.hudi.HudiPlugin`. | +| `pom.xml` | `<packaging>trino-plugin</packaging>`; pins `org.apache.hudi:hudi-trino:<version>`; SPI deps as `provided`. | +| `src/test/java/...` | All current Trino-side tests stay: `HudiQueryRunner`, `TestHudiSmokeTest`, `TestHudiMinioConnectorSmokeTest`, `TestHudiConnectorTest`, `TestHudiSharedMetastore`, `TestHudiSystemTables`, `TestHudiPlugin`, `TestHudiConfig`, plus data initializers. Required by the Trino-side test-coverage commitment. | + +#### Hudi-side `hudi-trino-plugin/` (the engine) + +Everything else from the current `hudi-trino-plugin/` work, organized exactly as it is today: + +| Subpackage | Responsibility | +|---|---| +| `io.trino.plugin.hudi` | `HudiConnectorFactory`, `HudiConnector`, `HudiMetadata`, `HudiSplitManager`, `HudiPageSourceProvider`, `HudiSplit`, `HudiTableHandle`, `HudiModule`, `HudiConfig`, `HudiSessionProperties`, `HudiTableProperties`, `HudiTransactionManager`, `HudiMetadataFactory`. | +| `.cache` | `HudiCacheKeyProvider` for file-system cache integration. | +| `.file` | `HudiBaseFile`, `HudiLogFile`, file metadata abstractions. | +| `.io` | `HudiTrinoIOFactory` (extends `HoodieIOFactory`), `HudiTrinoFileReaderFactory`, `TrinoSeekableDataInputStream`. | +| `.partition` | `HudiPartitionInfo`, `HiveHudiPartitionInfo`, `HudiPartitionInfoLoader` (async resumable task). | +| `.query` | `HudiDirectoryLister`, `HudiReadOptimizedDirectoryLister`, `HudiSnapshotDirectoryLister`; `query.index` package with 8 `HudiIndexSupport` strategies. | +| `.reader` | `HudiTrinoReaderContext extends HoodieReaderContext<IndexedRecord>` for MOR record merging. | +| `.split` | `HudiSplitFactory`, `HudiBackgroundSplitLoader`, `HudiSplitSource`, `HudiSplitWeightProvider`, `SizeBasedSplitWeightProvider`. | +| `.stats` | `HudiTableStatistics`, `TableStatisticsReader`. | +| `.storage` | `HudiTrinoStorage` (extends `HoodieStorage`), `HudiTrinoInlineStorage`, `TrinoStorageConfiguration`. | +| `.util` | Serialization helpers, column synthesis, tuple-domain conversion, table-type utilities. | + +### API boundary + +The boundary between the shim and the published artifact is **Trino's SPI itself** — no intermediate API layer is introduced. + +- **Shim → artifact:** `HudiPlugin.getConnectorFactories()` returns `new HudiConnectorFactory()` defined in the artifact. Trino's runtime then calls `factory.create(catalogName, config, context)`. The `ConnectorContext` argument carries everything the artifact needs — `TypeManager`, `NodeManager`, `MetadataProvider`, `PageSorter`, `PageIndexerFactory`, `OpenTelemetry`, `Tracer`, `CatalogHandle` — without the artifact importing implementation classes. +- **Artifact → Trino:** the artifact's `HudiConnector` exposes the standard SPI providers (`ConnectorMetadata`, `ConnectorSplitManager`, `ConnectorPageSourceProvider`, etc.). Trino calls these. Classloader context is handled by the standard `ClassLoaderSafe*` wrappers (`io.trino.plugin.base.classloader.*`) — already used today. + +### Maven dependencies for `hudi-trino` + +- **`compile`:** Hudi libs (`hudi-common`, `hudi-io`, `hudi-hive-sync`, `hudi-sync-common`) and Trino libs (`trino-filesystem`, `trino-hive`, `trino-metastore`, `trino-parquet`, `trino-cache`), Guice, Airlift, Caffeine. +- **`provided`:** `trino-spi`, `slice`, Jackson, OpenTelemetry API, JOL (supplied by Trino at runtime). +- **`runtime`:** log-manager, Dropwizard metrics, OpenTelemetry SDK, `trino-hive-formats`. +- **`test`:** Trino testing libs (`trino-testing`, `trino-main`, `trino-testing-containers`, `trino-hdfs`), AssertJ, JUnit 5, Hudi test JARs. + +**Version alignment policy.** Trino versions are authoritative for shared libraries (Avro, Parquet, Jackson, Airlift). The `hudi-trino` POM pins these via `<dependencyManagement>` to whatever the targeted Trino release uses. If Hudi internals need a newer version, the fix is on the Hudi side or via a Trino-version bump — never by shipping divergent classpath versions. + +### Build target on Hudi side + +Trino requires Java 25, while the rest of Hudi targets a lower Java floor. `hudi-trino-plugin` therefore lives behind a Maven profile (`-Phudi-trino`) and is **excluded from the default `mvn install` reactor**: + +```xml +<profile> + <id>hudi-trino</id> + <modules> + <module>hudi-trino-plugin</module> + </modules> +</profile> +``` + +Default build (`mvn install`) skips it; Trino-targeted build (`mvn install -Phudi-trino`) requires JDK 25. + +### CI + +Two new GitHub Actions on the Hudi side, required for any change touching `hudi-trino-plugin/**`: + +1. **`hudi-trino-ci.yml`** — runs the full test suite via `mvn verify -Phudi-trino` on JDK 25. Catches regressions before they ship in a Hudi release. +2. **`hudi-trino-compat.yml`** — nightly: pulls latest `trinodb/trino` master, builds Trino's relevant modules, then compiles `hudi-trino-plugin` against them. Compile-only; flags SPI drift before the next Trino release. + +On the Trino side, existing CI continues to build and test `plugin/trino-hudi`, exercising the published `hudi-trino` artifact end-to-end on every Trino PR. + +### Test strategy + +**Full test duplication.** The Trino-side smoke tests (`TestHudiSmokeTest`, `TestHudiMinioConnectorSmokeTest`, `TestHudiConnectorTest`, etc.) are mirrored on the Hudi side and additionally extended. + +- **Trino side runs them** on every Trino PR — fulfilling the Trino-side test-coverage commitment. +- **Hudi side runs them** on every Hudi PR touching `hudi-trino-plugin` — so Hudi contributors catch regressions before they ship in a Hudi release. The Hudi-side suite is also **expanded** with more granular unit tests covering split generation edge cases, all eight index-support strategies, the MOR record-merging path, lazy-commit-time snapshot isolation, and the cache-key provider. + +This duplication has a known cost — two places to update when adding tests — but is the right trade-off given: +- The Trino-side suite must remain comprehensive as agreed with the Trino community. +- Hudi-side contributors need fast feedback without waiting for a Trino-side PR cycle. + +### Risks & caveats + +- **Trino SPI drift.** A future Trino SPI change could break the pre-built `hudi-trino` artifact at runtime. Mitigation: the nightly compat CI flags incompatibilities against Trino master before a Trino release ships. +- **Avro / Parquet / Jackson version skew.** Resolved by policy: Trino's versions are authoritative, pinned via `<dependencyManagement>` in the `hudi-trino` POM. Hudi-side fixes or Trino-version bumps adjust to it. +- **Test-infrastructure coupling.** `hudi-trino-plugin`'s test scope depends on `trino-testing`, `trino-main`, etc., coupling the Hudi build to Trino artifacts on Maven Central. Acceptable cost. +- **Release coordination.** A critical fix in `hudi-trino` ships only via a Hudi release. Mitigation: keep the Trino-side shim trivial so virtually all fixes can land in `hudi-trino`, and increase Hudi release cadence. +- **License / ASF process.** Cross-project releases between two ASF projects; covered by standard PMC announcements at first release. + +## Rollout/Adoption Plan + +**Step 1 — Hudi 1.3.0 publishes `hudi-trino`.** Land the `hudi-trino-plugin` work in `apache/hudi` master behind the `-Phudi-trino` profile, land the two CI workflows, then publish `org.apache.hudi:hudi-trino:1.3.0` to Maven Central as part of the 1.3.0 release. Hudi commits to a more frequent release cadence going forward. Review Comment: I think whenever there are enough improvements in `hudi-trino` we can cut a new Hudi release. To start with, we can publish a major release every month with more frequent minor releases to stabilize this module. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
