GitHub user vinothchandar added a comment to the discussion: Making Hudi Github
Project More Agentic
I think indeed this is a very important step that the Hudi community needs to
take. I feel there is a fundamental gap here that an AI-native experience can
bridge for the project. For many years now, I think we've been having this
divide where users want both advanced, powerful features and don't want to deal
with the complexity of tuning and learning configs and things like that.
Without AI, this unfortunately was not possible. For any other database, you
get a system that can do many things powerfully; then you also need to spend
the time learning the system.
As the data lake became a lake house, you need data engineers to also level up
and become kind of data lake administrators. Unfortunately, this never
happened. Engineers are sold completely closed solutions even today for the
sake of ease of use.
But I think AI presents us an opportunity to bridge this gap once and for all
for open source. I would like to reimagine the interface for the project and
for any user to build a lake house to be a Claude Code or codex-like terminal
experience.
I think with lots of will and a little bit of skill, I think we can go get
there much faster than we think. The vision is you create, build, optimize,
deploy -- everything from a terminal agentic CLI.
For starters
- fully porting the hudi-cli to this model, with ability to connct to local fs,
cloud storage locations, catalogs will help the entire community do more
without breaking a sweat.
- Build some version aware (foundational models don't get this) RAG over
hudi-docs, can help optimize config tuning, along with an `optimize` skill.
I built a toy MCP server with some 27 tools, over the weekend, with auto
discovery of tables (local fs for now, we can bring catalogs and cloud storage
in there pretty easily). I think we can even make `@hudi/table/<table_name>`
work as MCP resources.
```
[13:25:58] hudi-agent % claude
▐▛███▜▌ Claude Code v2.1.92
...
...
❯ what hudi tables do I have?
⏺ hudi - hudi_list_tables (MCP)
⎿ {
"result": "Discovered 11 table(s):\n hudi_demo_table
COPY_ON_WRITE /private/tmp/hudi-
agent-demo/demo_table\n customers COPY_ON_WRITE
/private/tmp/hudi-datalake/custom
… +9 lines (ctrl+o to expand)
⏺ You have 11 Hudi tables across two storage locations:
┌─────────────────┬──────┬──────────────────────────────────────────┐
│ Table │ Type │ Path │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ hudi_demo_table │ COW │ /private/tmp/hudi-agent-demo/demo_table │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ customers │ COW │ /private/tmp/hudi-datalake/customers │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ shipments │ COW │ /private/tmp/hudi-datalake/shipments │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ payments │ COW │ /private/tmp/hudi-datalake/payments │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ products │ COW │ /private/tmp/hudi-datalake/products │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ user_sessions │ MOR │ /private/tmp/hudi-datalake/user_sessions │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ audit_log │ COW │ /private/tmp/hudi-datalake/audit_log │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ clickstream │ MOR │ /private/tmp/hudi-datalake/clickstream │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ inventory │ MOR │ /private/tmp/hudi-datalake/inventory │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ orders │ COW │ /private/tmp/hudi-datalake/orders │
├─────────────────┼──────┼──────────────────────────────────────────┤
│ reviews │ MOR │ /private/tmp/hudi-datalake/reviews │
└─────────────────┴──────┴──────────────────────────────────────────┘
6 are Copy-on-Write and 4 are Merge-on-Read tables. Most live under
/private/tmp/hudi-datalake/. Want to inspect any of
them?
❯ Analyze layout of table clickstream
⏺ hudi - hudi_describe_table (MCP)(table_name: "clickstream")
⎿ {
"result": "[\n {\n \"Property\": \"basePath\",\n \"Value\":
\"/private/tmp/hudi-datalake/clickstream\"\
n },\n {\n \"Property\": \"metaPath\",\n \"Value\":
\"/private/tmp/hudi-datalake/clickstream/.hoodie\"\n
… +25 lines (ctrl+o to expand)
⏺ hudi - hudi_fetch_schema (MCP)(table_name: "clickstream")
⎿ {
"result": "{\n \"type\" : \"record\",\n \"name\" :
\"clickstream_record\",\n \"namespace\" : \"hoodie.click
stream\",\n \"fields\" : [ {\n \"name\" : \"_hoodie_commit_time\",\n
\"type\" : [ \"null\", \"string\" ],
… +10 lines (ctrl+o to expand)
⏺ hudi - hudi_timeline_active (MCP)(table_name: "clickstream")
⎿ {
"result": "[\n {\n \"Instant\": \"20260404203259123\",\n
\"Action\": \"deltacommit\",\n \"State\": \
"COMPLETED\"\n }\n]"
}
⏺ hudi - hudi_metadata_partitions (MCP)(table_name: "clickstream")
⎿ {
"result": "[\n \"dt=2024-01-01\",\n \"dt=2024-01-02\",\n
\"dt=2024-01-03\"\n]"
}
⏺ hudi - hudi_metadata_stats (MCP)(table_name: "clickstream")
⎿ {
"result": "[]"
}
⏺ hudi - hudi_commits_show (MCP)(table_name: "clickstream")
⎿ {
"result": "[\n {\n \"CommitTime\": \"20260404203259123\",\n
\"Action\": \"deltacommit\",\n \"TotalBy
tesWritten\": 1307319,\n \"TotalFilesInsert\": 3,\n
\"TotalFilesUpdated\": 0,\n \"Partitions\": 3,\n
… +2 lines (ctrl+o to expand)
⏺ hudi - hudi_metadata_files (MCP)(table_name: "clickstream")
⎿ {
"result": "[]"
}
⏺ hudi - hudi_compactions_show (MCP)(table_name: "clickstream")
⎿ {
"result": "[]"
}
⏺ Here's the full layout analysis for clickstream:
---
Table Identity
┌─────────────────┬────────────────────────────────────────┐
│ Property │ Value │
├─────────────────┼────────────────────────────────────────┤
│ Type │ Merge-on-Read (MOR) │
├─────────────────┼────────────────────────────────────────┤
│ Base path │ /private/tmp/hudi-datalake/clickstream │
├─────────────────┼────────────────────────────────────────┤
│ Record key │ id │
├─────────────────┼────────────────────────────────────────┤
│ Partition field │ dt (hive-style) │
├─────────────────┼────────────────────────────────────────┤
│ Ordering field │ ts │
├─────────────────┼────────────────────────────────────────┤
│ Merge mode │ EVENT_TIME_ORDERING │
├─────────────────┼────────────────────────────────────────┤
│ Key generator │ SIMPLE │
├─────────────────┼────────────────────────────────────────┤
│ Table version │ 9 │
├─────────────────┼────────────────────────────────────────┤
│ Timeline layout │ v2 │
└─────────────────┴────────────────────────────────────────┘
Schema (6 user fields + 5 meta fields)
┌─────────┬─────────────────────────────────────┐
│ Field │ Type │
├─────────┼─────────────────────────────────────┤
│ id │ int (nullable) │
├─────────┼─────────────────────────────────────┤
│ user_id │ int (nullable) │
├─────────┼─────────────────────────────────────┤
│ page │ string (nullable) │
├─────────┼─────────────────────────────────────┤
│ action │ string (nullable) │
├─────────┼─────────────────────────────────────┤
│ ts │ long (nullable) — ordering field │
├─────────┼─────────────────────────────────────┤
│ dt │ string (nullable) — partition field │
└─────────┴─────────────────────────────────────┘
Plus the standard _hoodie_commit_time, _hoodie_commit_seqno,
_hoodie_record_key, _hoodie_partition_path,
_hoodie_file_name meta fields.
Partitions (3)
- dt=2024-01-01
- dt=2024-01-02
- dt=2024-01-03
Timeline & Commits
Single completed deltacommit at 20260404203259123:
- 8 records inserted across 3 partitions
- 3 log files written (one per partition — typical for MOR inserts)
- ~1.3 MB total bytes written
- 0 updates, 0 write errors
Metadata Table
Enabled indexes: files, column_stats, partition_stats
Compactions
None pending. Since there's only one deltacommit with log files and no base
(parquet) files yet, the table is entirely
in delta log form. A compaction would convert the logs into columnar base
files — something to consider if you plan to
run read-optimized queries.
---
Observations
1. Very young table — single commit, 8 records. Essentially a fresh table
with seed data.
2. All data in log files — no base files exist yet. Snapshot queries will
read logs directly; read-optimized queries
would return nothing until a compaction runs.
3. Even distribution — ~2-3 records per partition across the 3 date
partitions.
4. Good metadata setup — column_stats and partition_stats are enabled, which
will help with data skipping as the table
grows.
Would you like me to dig deeper into any specific aspect, or look at the
file-level details within a partition?
✻ Cooked for 33s
```
GitHub link:
https://github.com/apache/hudi/discussions/18324#discussioncomment-16468082
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]