yihua commented on code in PR #18782:
URL: https://github.com/apache/hudi/pull/18782#discussion_r3277835098


##########
rfc/rfc-105/rfc-105.md:
##########
@@ -0,0 +1,225 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-105: Trino Hudi Connector — Shim/Bundle Refactor
+
+## Proposers
+
+- @yihua
+- @voonhous
+
+## Approvers
+
+- @codope
+- @vinothchandar
+
+## Status
+
+Issue: [HUDI-18780](https://github.com/apache/hudi/issues/18780)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Motivation
+
+The Trino-Hudi connector currently lives in `trinodb/trino` at 
`plugin/trino-hudi`. Maintaining and evolving the connector through the 
Trino-OSS-only path has stalled in practice, and the cost falls on Hudi users:
+
+- **Hudi-side improvement PRs to the Trino Hudi connector are not landing.** 
Four stacked PRs targeting the Trino Hudi connector were closed by Trino's 
stale-bot for lack of review:
+  - [trinodb/trino#28518](https://github.com/trinodb/trino/pull/28518)
+  - [trinodb/trino#28533](https://github.com/trinodb/trino/pull/28533)
+  - [trinodb/trino#28644](https://github.com/trinodb/trino/pull/28644)
+  - [trinodb/trino#28645](https://github.com/trinodb/trino/pull/28645)
+- **Significant Hudi-side work for the Trino connector is ready but cannot 
land** through the current path: metadata-table-driven partition listing, eight 
`HudiIndexSupport` strategies (column stats, partition stats, record-level, 
secondary, expression, bloom, bucket, partition bloom), MOR snapshot-isolation 
fixes (worker-side use of the latest commit time from the table handle), and 
file-system caching integration.
+- **The current arrangement does not scale.** Connector evolution must go 
through Trino-side review for every change, while the expertise and the 
source-of-truth for Hudi internals live in this project. Hudi releases cannot 
directly deliver improvements to Hudi users querying via Trino.
+
+Following alignment between the Hudi and Trino communities, the agreed 
direction is to split the connector into a thin Trino-side shim plus a 
Hudi-published artifact carrying the connector logic. This lets the Hudi 
project ship Trino-Hudi improvements with each Hudi release, while Trino picks 
them up via a one-line dependency-version bump.
+
+The single requirement carried over from the Trino side is that a 
comprehensive test suite for the connector continues to be maintained on the 
Trino side. This RFC documents the agreed approach and the implementation plan.
+
+## Abstract
+
+We split the Trino-Hudi connector into two Maven artifacts:
+
+1. **`io.trino:trino-hudi`** stays in Trino OSS (`plugin/trino-hudi`) as a 
thin shim — a `HudiPlugin` class that registers the `io.trino.spi.Plugin` SPI 
entry point — plus the test harness (smoke tests, query runners, MinIO-backed 
integration tests). This module mostly does not change once landed.
+2. **`org.apache.hudi:hudi-trino`** is a new Hudi-published Maven artifact 
(regular, non-shaded JAR) containing the actual connector logic at 
`io.trino.plugin.hudi.*` — `HudiConnectorFactory`, `HudiConnector`, 
`HudiMetadata`, `HudiSplitManager`, `HudiPageSourceProvider`, all index-support 
strategies, the `HoodieStorage`/`HoodieIOFactory` bridges to Trino's 
filesystem, etc. The artifact is built against the latest Trino release's SPI; 
it declares `hudi-common`, `hudi-io`, etc. as transitive dependencies and 
Trino's `trino-spi`, `trino-filesystem`, etc. as `provided`.
+
+The first publication ships in **Hudi 1.3.0**. The Trino-side shim PR pins 
`org.apache.hudi:hudi-trino:1.3.0`. Going forward, all Trino-Hudi connector 
evolution happens in Hudi OSS; Trino picks up changes by bumping the dependency 
version. To support this integration model, **Hudi will increase its release 
cadence**.
+
+## Background
+
+### State of the Trino-side connector today
+
+`plugin/trino-hudi` in `trinodb/trino` is the baseline: it implements the 
standard Trino SPI (`Plugin`, `ConnectorFactory`, `Connector`, 
`ConnectorMetadata`, `ConnectorSplitManager`, `ConnectorPageSourceProvider`, 
etc.), depends on `hudi-common` and `hudi-io`, and uses Hudi's `HoodieStorage` 
abstraction (RFC-74) over Trino's `TrinoFileSystem`. No direct Hadoop imports.
+
+### State of the Hudi-side `hudi-trino-plugin` work
+
+A more advanced version of the connector exists in Hudi-side branches under 
`hudi-trino-plugin/` (same `io.trino.plugin.hudi.*` package, built against a 
recent Trino release). On top of the Trino-OSS baseline it adds:
+
+- Eight `HudiIndexSupport` strategies (column stats, partition stats, 
record-level, secondary, expression, bloom, bucket, partition bloom) for file- 
and partition-level pruning via metadata tables.
+- Metadata-table-driven partition discovery (async, resumable).
+- MOR record-level merging via `HoodieFileGroupReader` 
(`HudiTrinoReaderContext`).
+- Lazy commit-time on `HudiTableHandle` for snapshot-isolated MOR reads across 
workers.
+- Background, weighted split generation; size-based split weighting; 
multi-reader routing (`HudiPageSource` for MOR, `HudiBaseFileOnlyPageSource` 
for COW/RO).
+- File-system cache integration.
+- HoodieStorage / HoodieIOFactory bridges over `TrinoFileSystem` 
(`HudiTrinoStorage`, `HudiTrinoInlineStorage`, `HudiTrinoIOFactory`).
+
+This is the body of code that will move into the `hudi-trino` Maven module on 
the Hudi side.
+
+### Why a "shim + Hudi-published artifact" pattern
+
+This pattern decouples Trino-Hudi connector evolution from the Trino-side 
release cycle:
+
+- The Hudi project can publish Trino-Hudi improvements with each Hudi release, 
without waiting for Trino-side reviews of every change.
+- The Trino-side surface shrinks to a stable plugin-registration shim, so 
Trino-side review burden is minimal — typically a one-line version bump per 
Hudi release.
+- All Hudi-Trino integration code (`io.trino.plugin.hudi.*`) is co-located 
with the Hudi core libraries it depends on. Changes that cross the 
Hudi-internal / connector boundary can land atomically.
+- The artifact is **purpose-built for Trino** and implements Trino's SPI 
directly, so no intermediate adapter layer is needed between the published 
artifact and the Trino plugin.
+
+Trino's `trino-spi` is governed by `revapi-maven-plugin` (see 
`core/trino-spi/pom.xml`) which enforces backward compatibility on the SPI 
surface. This is what makes a single `hudi-trino` artifact targeting the latest 
Trino release viable across multiple subsequent Trino releases.
+
+Trino loads each plugin in an isolated `URLClassLoader`. Transitive 
dependencies of `hudi-trino` (Avro, Parquet, etc.) are isolated to the plugin's 
classloader and cannot conflict with other plugins.
+
+## Implementation
+
+### Architecture
+
+```
+trinodb/trino : plugin/trino-hudi   (packaging = trino-plugin)
+    HudiPlugin.java       ← thin shim: trivial Plugin SPI registration
+    META-INF/services/io.trino.spi.Plugin
+    src/test/java/...     ← full Trino-side test suite
+    pom.xml               ← depends on org.apache.hudi:hudi-trino:1.3.0
+                                       │
+                                       │  Maven Central
+                                       ▼
+apache/hudi : hudi-trino-plugin/    (Maven profile -Phudi-trino,
+                                     excluded from default reactor,
+                                     JDK 25 required)
+    io.trino.plugin.hudi.*           ← all connector logic:
+        HudiConnectorFactory, HudiConnector, HudiMetadata,
+        HudiSplitManager, HudiPageSourceProvider,
+        cache/, file/, io/, partition/,
+        query/ (incl. 8 index-support strategies),
+        reader/, split/, stats/, storage/, util/
+    src/test/java/...                 ← full duplicated + expanded suite
+  Published as: org.apache.hudi:hudi-trino:1.3.0
+```
+
+### What lives where
+
+#### Trino-side `plugin/trino-hudi` (the shim)
+
+| File | Purpose |
+|---|---|
+| `src/main/java/io/trino/plugin/hudi/HudiPlugin.java` | Implements 
`io.trino.spi.Plugin`. Single method returning `new HudiConnectorFactory()` 
(from the `hudi-trino` artifact). ~10 lines. |
+| `src/main/resources/META-INF/services/io.trino.spi.Plugin` | Service-loader 
pointer to `io.trino.plugin.hudi.HudiPlugin`. |
+| `pom.xml` | `<packaging>trino-plugin</packaging>`; pins 
`org.apache.hudi:hudi-trino:<version>`; SPI deps as `provided`. |
+| `src/test/java/...` | All current Trino-side tests stay: `HudiQueryRunner`, 
`TestHudiSmokeTest`, `TestHudiMinioConnectorSmokeTest`, 
`TestHudiConnectorTest`, `TestHudiSharedMetastore`, `TestHudiSystemTables`, 
`TestHudiPlugin`, `TestHudiConfig`, plus data initializers. Required by the 
Trino-side test-coverage commitment. |
+
+#### Hudi-side `hudi-trino-plugin/` (the engine)
+
+Everything else from the current `hudi-trino-plugin/` work, organized exactly 
as it is today:
+
+| Subpackage | Responsibility |
+|---|---|
+| `io.trino.plugin.hudi` | `HudiConnectorFactory`, `HudiConnector`, 
`HudiMetadata`, `HudiSplitManager`, `HudiPageSourceProvider`, `HudiSplit`, 
`HudiTableHandle`, `HudiModule`, `HudiConfig`, `HudiSessionProperties`, 
`HudiTableProperties`, `HudiTransactionManager`, `HudiMetadataFactory`. |
+| `.cache` | `HudiCacheKeyProvider` for file-system cache integration. |
+| `.file` | `HudiBaseFile`, `HudiLogFile`, file metadata abstractions. |
+| `.io` | `HudiTrinoIOFactory` (extends `HoodieIOFactory`), 
`HudiTrinoFileReaderFactory`, `TrinoSeekableDataInputStream`. |
+| `.partition` | `HudiPartitionInfo`, `HiveHudiPartitionInfo`, 
`HudiPartitionInfoLoader` (async resumable task). |
+| `.query` | `HudiDirectoryLister`, `HudiReadOptimizedDirectoryLister`, 
`HudiSnapshotDirectoryLister`; `query.index` package with 8 `HudiIndexSupport` 
strategies. |
+| `.reader` | `HudiTrinoReaderContext extends 
HoodieReaderContext<IndexedRecord>` for MOR record merging. |
+| `.split` | `HudiSplitFactory`, `HudiBackgroundSplitLoader`, 
`HudiSplitSource`, `HudiSplitWeightProvider`, `SizeBasedSplitWeightProvider`. |
+| `.stats` | `HudiTableStatistics`, `TableStatisticsReader`. |
+| `.storage` | `HudiTrinoStorage` (extends `HoodieStorage`), 
`HudiTrinoInlineStorage`, `TrinoStorageConfiguration`. |
+| `.util` | Serialization helpers, column synthesis, tuple-domain conversion, 
table-type utilities. |
+
+### API boundary
+
+The boundary between the shim and the published artifact is **Trino's SPI 
itself** — no intermediate API layer is introduced.
+
+- **Shim → artifact:** `HudiPlugin.getConnectorFactories()` returns `new 
HudiConnectorFactory()` defined in the artifact. Trino's runtime then calls 
`factory.create(catalogName, config, context)`. The `ConnectorContext` argument 
carries everything the artifact needs — `TypeManager`, `NodeManager`, 
`MetadataProvider`, `PageSorter`, `PageIndexerFactory`, `OpenTelemetry`, 
`Tracer`, `CatalogHandle` — without the artifact importing implementation 
classes.
+- **Artifact → Trino:** the artifact's `HudiConnector` exposes the standard 
SPI providers (`ConnectorMetadata`, `ConnectorSplitManager`, 
`ConnectorPageSourceProvider`, etc.). Trino calls these. Classloader context is 
handled by the standard `ClassLoaderSafe*` wrappers 
(`io.trino.plugin.base.classloader.*`) — already used today.
+
+### Maven dependencies for `hudi-trino`
+
+- **`compile`:** Hudi libs (`hudi-common`, `hudi-io`, `hudi-hive-sync`, 
`hudi-sync-common`) and Trino libs (`trino-filesystem`, `trino-hive`, 
`trino-metastore`, `trino-parquet`, `trino-cache`), Guice, Airlift, Caffeine.
+- **`provided`:** `trino-spi`, `slice`, Jackson, OpenTelemetry API, JOL 
(supplied by Trino at runtime).
+- **`runtime`:** log-manager, Dropwizard metrics, OpenTelemetry SDK, 
`trino-hive-formats`.
+- **`test`:** Trino testing libs (`trino-testing`, `trino-main`, 
`trino-testing-containers`, `trino-hdfs`), AssertJ, JUnit 5, Hudi test JARs.
+
+**Version alignment policy.** Trino versions are authoritative for shared 
libraries (Avro, Parquet, Jackson, Airlift). The `hudi-trino` POM pins these 
via `<dependencyManagement>` to whatever the targeted Trino release uses. If 
Hudi internals need a newer version, the fix is on the Hudi side or via a 
Trino-version bump — never by shipping divergent classpath versions.
+
+### Build target on Hudi side
+
+Trino requires Java 25, while the rest of Hudi targets a lower Java floor. 
`hudi-trino-plugin` therefore lives behind a Maven profile (`-Phudi-trino`) and 
is **excluded from the default `mvn install` reactor**:
+
+```xml
+<profile>
+  <id>hudi-trino</id>
+  <modules>
+    <module>hudi-trino-plugin</module>
+  </modules>
+</profile>
+```
+
+Default build (`mvn install`) skips it; Trino-targeted build (`mvn install 
-Phudi-trino`) requires JDK 25.
+
+### CI
+
+Two new GitHub Actions on the Hudi side, required for any change touching 
`hudi-trino-plugin/**`:
+
+1. **`hudi-trino-ci.yml`** — runs the full test suite via `mvn verify 
-Phudi-trino` on JDK 25. Catches regressions before they ship in a Hudi release.
+2. **`hudi-trino-compat.yml`** — nightly: pulls latest `trinodb/trino` master, 
builds Trino's relevant modules, then compiles `hudi-trino-plugin` against 
them. Compile-only; flags SPI drift before the next Trino release.

Review Comment:
   Yes, we'll pull the code from main in both repos, compile, and run tests in 
the GH action to confirm compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to