Re: [D] Making Hudi Github Project More Agentic [hudi]

via GitHub Mon, 06 Apr 2026 15:05:14 -0700


GitHub user vinothchandar added a comment to the discussion: Making Hudi Github 
Project More Agentic


I think indeed this is a very important step that the Hudi community needs to 
take. I feel there is a fundamental gap here that an AI-native experience can 
bridge for the project. For many years now, I think we've been having this 
divide where users want both advanced, powerful features and don't want to deal 
with the complexity of tuning and learning configs and things like that. 
Without AI, this unfortunately was not possible. For any other database, you 
get a system that can do many things powerfully; then you also need to spend 
the time learning the system. 

As the data lake became a lake house, you need data engineers to also level up 
and become kind of data lake administrators. Unfortunately, this never 
happened. Engineers are sold completely closed solutions even today for the 
sake of ease of use. 

But I think AI presents us an opportunity to bridge this gap once and for all 
for open source. I would like to reimagine the interface for the project and 
for any user to build a lake house to be a Claude Code or codex-like terminal 
experience. 

I think with lots of will and a little bit of skill, I think we can go get 
there much faster than we think. The vision is you create, build, optimize, 
deploy -- everything from a terminal agentic CLI. 

For starters 
- fully porting the hudi-cli to this model, with ability to connct to local fs, 
cloud storage locations, catalogs will help the entire community do more 
without breaking a sweat. 
- Build some version aware (foundational models don't get this) RAG over 
hudi-docs, can help optimize config tuning, along with an `optimize` skill. 

I built a toy MCP server with some 27 tools, over the weekend, with auto 
discovery of tables (local fs for now, we can bring catalogs and cloud storage 
in there pretty easily). I think we can even make `@hudi/table/<table_name>` 
work as MCP resources. 

```
[13:25:58] hudi-agent % claude
 ▐▛███▜▌   Claude Code v2.1.92
...
...
❯ what hudi tables do I have?

⏺ hudi - hudi_list_tables (MCP)
  ⎿  {
       "result": "Discovered 11 table(s):\n  hudi_demo_table                 
COPY_ON_WRITE         /private/tmp/hudi-
     agent-demo/demo_table\n  customers                       COPY_ON_WRITE     
    /private/tmp/hudi-datalake/custom
     … +9 lines (ctrl+o to expand)

⏺ You have 11 Hudi tables across two storage locations:

  ┌─────────────────┬──────┬──────────────────────────────────────────┐
  │      Table      │ Type │                   Path                   │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ hudi_demo_table │ COW  │ /private/tmp/hudi-agent-demo/demo_table  │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ customers       │ COW  │ /private/tmp/hudi-datalake/customers     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ shipments       │ COW  │ /private/tmp/hudi-datalake/shipments     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ payments        │ COW  │ /private/tmp/hudi-datalake/payments      │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ products        │ COW  │ /private/tmp/hudi-datalake/products      │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ user_sessions   │ MOR  │ /private/tmp/hudi-datalake/user_sessions │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ audit_log       │ COW  │ /private/tmp/hudi-datalake/audit_log     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ clickstream     │ MOR  │ /private/tmp/hudi-datalake/clickstream   │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ inventory       │ MOR  │ /private/tmp/hudi-datalake/inventory     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ orders          │ COW  │ /private/tmp/hudi-datalake/orders        │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ reviews         │ MOR  │ /private/tmp/hudi-datalake/reviews       │
  └─────────────────┴──────┴──────────────────────────────────────────┘

  6 are Copy-on-Write and 4 are Merge-on-Read tables. Most live under 
/private/tmp/hudi-datalake/. Want to inspect any of
  them?

❯ Analyze layout of table clickstream

⏺ hudi - hudi_describe_table (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  {\n    \"Property\": \"basePath\",\n    \"Value\": 
\"/private/tmp/hudi-datalake/clickstream\"\
     n  },\n  {\n    \"Property\": \"metaPath\",\n    \"Value\": 
\"/private/tmp/hudi-datalake/clickstream/.hoodie\"\n
     … +25 lines (ctrl+o to expand)

⏺ hudi - hudi_fetch_schema (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "{\n  \"type\" : \"record\",\n  \"name\" : 
\"clickstream_record\",\n  \"namespace\" : \"hoodie.click
     stream\",\n  \"fields\" : [ {\n    \"name\" : \"_hoodie_commit_time\",\n   
 \"type\" : [ \"null\", \"string\" ],
     … +10 lines (ctrl+o to expand)

⏺ hudi - hudi_timeline_active (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  {\n    \"Instant\": \"20260404203259123\",\n    
\"Action\": \"deltacommit\",\n    \"State\": \
     "COMPLETED\"\n  }\n]"
     }

⏺ hudi - hudi_metadata_partitions (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  \"dt=2024-01-01\",\n  \"dt=2024-01-02\",\n  
\"dt=2024-01-03\"\n]"
     }

⏺ hudi - hudi_metadata_stats (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[]"
     }

⏺ hudi - hudi_commits_show (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  {\n    \"CommitTime\": \"20260404203259123\",\n    
\"Action\": \"deltacommit\",\n    \"TotalBy
     tesWritten\": 1307319,\n    \"TotalFilesInsert\": 3,\n    
\"TotalFilesUpdated\": 0,\n    \"Partitions\": 3,\n
     … +2 lines (ctrl+o to expand)

⏺ hudi - hudi_metadata_files (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[]"
     }

⏺ hudi - hudi_compactions_show (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[]"
     }

⏺ Here's the full layout analysis for clickstream:

  ---
  Table Identity

  ┌─────────────────┬────────────────────────────────────────┐
  │    Property     │                 Value                  │
  ├─────────────────┼────────────────────────────────────────┤
  │ Type            │ Merge-on-Read (MOR)                    │
  ├─────────────────┼────────────────────────────────────────┤
  │ Base path       │ /private/tmp/hudi-datalake/clickstream │
  ├─────────────────┼────────────────────────────────────────┤
  │ Record key      │ id                                     │
  ├─────────────────┼────────────────────────────────────────┤
  │ Partition field │ dt (hive-style)                        │
  ├─────────────────┼────────────────────────────────────────┤
  │ Ordering field  │ ts                                     │
  ├─────────────────┼────────────────────────────────────────┤
  │ Merge mode      │ EVENT_TIME_ORDERING                    │
  ├─────────────────┼────────────────────────────────────────┤
  │ Key generator   │ SIMPLE                                 │
  ├─────────────────┼────────────────────────────────────────┤
  │ Table version   │ 9                                      │
  ├─────────────────┼────────────────────────────────────────┤
  │ Timeline layout │ v2                                     │
  └─────────────────┴────────────────────────────────────────┘

  Schema (6 user fields + 5 meta fields)

  ┌─────────┬─────────────────────────────────────┐
  │  Field  │                Type                 │
  ├─────────┼─────────────────────────────────────┤
  │ id      │ int (nullable)                      │
  ├─────────┼─────────────────────────────────────┤
  │ user_id │ int (nullable)                      │
  ├─────────┼─────────────────────────────────────┤
  │ page    │ string (nullable)                   │
  ├─────────┼─────────────────────────────────────┤
  │ action  │ string (nullable)                   │
  ├─────────┼─────────────────────────────────────┤
  │ ts      │ long (nullable) — ordering field    │
  ├─────────┼─────────────────────────────────────┤
  │ dt      │ string (nullable) — partition field │
  └─────────┴─────────────────────────────────────┘

  Plus the standard _hoodie_commit_time, _hoodie_commit_seqno, 
_hoodie_record_key, _hoodie_partition_path,
  _hoodie_file_name meta fields.

  Partitions (3)

  - dt=2024-01-01
  - dt=2024-01-02
  - dt=2024-01-03

  Timeline & Commits

  Single completed deltacommit at 20260404203259123:
  - 8 records inserted across 3 partitions
  - 3 log files written (one per partition — typical for MOR inserts)
  - ~1.3 MB total bytes written
  - 0 updates, 0 write errors

  Metadata Table

  Enabled indexes: files, column_stats, partition_stats

  Compactions

  None pending. Since there's only one deltacommit with log files and no base 
(parquet) files yet, the table is entirely
  in delta log form. A compaction would convert the logs into columnar base 
files — something to consider if you plan to
  run read-optimized queries.

  ---
  Observations

  1. Very young table — single commit, 8 records. Essentially a fresh table 
with seed data.
  2. All data in log files — no base files exist yet. Snapshot queries will 
read logs directly; read-optimized queries
  would return nothing until a compaction runs.
  3. Even distribution — ~2-3 records per partition across the 3 date 
partitions.
  4. Good metadata setup — column_stats and partition_stats are enabled, which 
will help with data skipping as the table
  grows.

  Would you like me to dig deeper into any specific aspect, or look at the 
file-level details within a partition?

✻ Cooked for 33s

```


GitHub link: 
https://github.com/apache/hudi/discussions/18324#discussioncomment-16468082

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Making Hudi Github Project More Agentic [hudi]

Reply via email to