This is an automated email from the ASF dual-hosted git repository.
yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new c1199fed37d3 docs: Restructure and revise 1.2.0 Unstructured Data
pages (#18876)
c1199fed37d3 is described below
commit c1199fed37d303be4e5a6a32565b9668c3086b48
Author: Y Ethan Guo <[email protected]>
AuthorDate: Thu May 28 23:18:26 2026 -0700
docs: Restructure and revise 1.2.0 Unstructured Data pages (#18876)
---
website/docs/ai_overview.md | 188 -----------
website/docs/blob_unstructured_data.md | 350 --------------------
website/docs/ingestion_flink.md | 20 ++
website/docs/lance_file_format.md | 220 -------------
website/docs/overview.mdx | 13 +-
website/docs/sql_ddl.md | 192 ++++++++++-
website/docs/sql_dml.md | 75 +++++
website/docs/sql_queries.md | 182 ++++++++++-
website/docs/storage_layouts.md | 92 ++++++
website/docs/syncing_metastore.md | 21 ++
...e.md => unstructured-data-quick-start-guide.md} | 47 +--
website/docs/use_cases.md | 35 ++
website/docs/variant_type.md | 362 ---------------------
website/docs/vector_search.md | 274 ----------------
website/docs/writing_data.md | 133 ++++++++
website/docusaurus.config.js | 4 +
website/sidebars.js | 13 +-
17 files changed, 762 insertions(+), 1459 deletions(-)
diff --git a/website/docs/ai_overview.md b/website/docs/ai_overview.md
deleted file mode 100644
index c28bcae6543a..000000000000
--- a/website/docs/ai_overview.md
+++ /dev/null
@@ -1,188 +0,0 @@
----
-title: "AI-Native Lakehouse"
-keywords: [ hudi, ai, machine learning, vector search, embeddings,
unstructured data, blob, lance, multimodal, lakehouse]
-summary: "Apache Hudi brings first-class AI and unstructured data support to
the data lakehouse — store embeddings, images, audio, and run vector search
natively"
-toc: true
-last_modified_at: 2026-04-25T00:00:00-00:00
----
-
-Modern AI workloads demand more from the data lakehouse than traditional
analytics ever did. Teams building
-retrieval-augmented generation (RAG) pipelines, recommendation systems,
content moderation, and multimodal
-search need to store and query **embeddings**, **images**, **audio**,
**video**, and **documents** alongside
-structured data — all with the transactional guarantees, incremental
processing, and table services that
-Hudi is known for.
-
-Apache Hudi's AI-native capabilities bring this vision to life with four
foundational building blocks:
-
-## Core Capabilities
-
-### VECTOR Type and Similarity Search
-
-Store high-dimensional embedding vectors as first-class column types and run
vector similarity
-search directly in Spark SQL.
-
-```sql
--- Declare an embedding column
-CREATE TABLE products (
- id STRING,
- name STRING,
- embedding VECTOR(768)
-) USING hudi;
-
--- Find the 10 most similar products
-SELECT id, name, _hudi_distance
-FROM hudi_vector_search('products', 'embedding', ARRAY(...), 10, 'cosine');
-```
-
-No external vector database required. Your embeddings live in the same table
as your structured data, governed
-by the same transactions, schemas, and access controls.
-
-**[Learn more about Vector Search →](vector_search.md)**
-
-### BLOB Type for Unstructured Data
-
-Store raw binary data (images, PDFs, audio clips, model weights) directly in
Hudi tables using the `BLOB` type.
-Choose between two storage modes:
-
-| Mode | How it works | Best for |
-|:-----|:-------------|:---------|
-| **Inline** | Bytes stored directly in the table row | Small objects
(thumbnails, short audio clips) |
-| **Out-of-line** | Table stores a pointer; `read_blob()` resolves it on
demand | Large objects (high-res images, video, model checkpoints) |
-
-```sql
-CREATE TABLE documents (
- doc_id STRING,
- content BLOB,
- summary STRING
-) USING hudi;
-
--- Read the raw bytes on demand
-SELECT doc_id, read_blob(content) AS raw_bytes FROM documents WHERE doc_id =
'doc_001';
-```
-
-Out-of-line BLOBs keep your table footprint small (often less than 1% of total
data size) while maintaining
-full queryability.
-
-**[Learn more about Unstructured Data →](blob_unstructured_data.md)**
-
-### VARIANT Type for Semi-Structured Data
-
-AI pipelines often deal with data whose shape is not known in advance: LLM
outputs, model metadata,
-feature dictionaries, API responses. The `VARIANT` type stores
semi-structured, JSON-like data with
-full transactional support — no rigid schema required.
-
-```sql
-CREATE TABLE llm_outputs (
- request_id STRING,
- response VARIANT,
- ts BIGINT
-) USING hudi;
-
--- Store any JSON structure
-INSERT INTO llm_outputs VALUES (
- 'req_001',
- parse_json('{"text": "...", "stop_reason": "end_turn", "tokens": 700}'),
- 1000
-);
-
--- Query back as JSON
-SELECT request_id, cast(response as STRING) FROM llm_outputs;
-```
-
-VARIANT supports optional **shredding** to extract hot fields into typed
columnar storage for better
-query performance, while keeping the flexibility for everything else.
-
-**[Learn more about Semi-Structured Data →](variant_type.md)**
-
-### Lance File Format
-
-Hudi's pluggable file format architecture supports **Lance**, a modern
columnar format purpose-built for
-AI/ML workloads. Lance provides:
-
-- Native vector column encoding (`FixedSizeList`) — no conversion overhead at
the file-format layer
-- Fast random access for training data sampling
-- Optimized storage for high-dimensional arrays and nested structures
-
-```sql
-CREATE TABLE embeddings (...) USING hudi
-TBLPROPERTIES (
- hoodie.datasource.write.base.file.format = 'lance'
-);
-```
-
-Lance integrates seamlessly with Hudi's table services (compaction,
clustering, cleaning) and
-works alongside existing Parquet and ORC tables.
-
-**[Learn more about the Lance File Format →](lance_file_format.md)**
-
-## Why Hudi for AI Workloads?
-
-### Unified Storage for Structured + Unstructured Data
-
-Most AI pipelines today span multiple systems: a data warehouse for metadata,
an object store for raw files,
-a vector database for embeddings, and custom glue code to keep them in sync.
Hudi collapses this into a single
-table:
-
-```
-┌─────────────────────────────────────────────────┐
-│ Hudi Table │
-│ │
-│ image_id │ breed │ embedding │ image_bytes │
-│ (STRING) │(STRING)│(VECTOR(1024))│ (BLOB) │
-│───────────┼───────┼──────────────┼──────────────│
-│ pet_001 │ Corgi │ [0.12, ...] │ <137 KB PNG> │
-│ pet_002 │ Tabby │ [-.03, ...] │ <89 KB JPEG> │
-└─────────────────────────────────────────────────┘
-```
-
-One table. One set of transactions. One schema. One set of access controls.
-
-### Incremental Processing for Embedding Pipelines
-
-When new data arrives, you do not need to re-embed your entire corpus. Hudi's
incremental query capabilities
-let you process only new or changed records:
-
-```sql
--- Get only new images since the last embedding run
-SELECT * FROM hudi_table_changes('product_images', 'latest_state',
'20260101000000');
-```
-
-This can reduce embedding pipeline costs by 10-100x compared to full
reprocessing.
-
-### Transactional Guarantees
-
-Embedding updates, metadata changes, and raw data writes happen atomically. No
more inconsistent states where
-your vector index points to deleted images or stale embeddings.
-
-### Table Services
-
-Hudi's background table services work on AI tables just like any other:
-
-- **Clustering** — co-locate similar vectors for better search locality
-- **Compaction** — merge incremental updates efficiently
-- **Cleaning** — reclaim storage from old versions
-- **Indexing** — maintain metadata indexes for fast lookups
-
-### Open Ecosystem
-
-Hudi tables are readable by Spark, Flink, Presto, Trino, and the native
Python/Rust client (hudi-rs). Your AI
-tables are not locked into a single engine or vendor.
-
-## Use Cases
-
-| Use Case | Hudi Capabilities Used |
-|:---------|:----------------------|
-| **Image/Video Search** | VECTOR embeddings + BLOB storage + cosine
similarity search |
-| **RAG (Retrieval-Augmented Generation)** | VECTOR search to retrieve
relevant document chunks for LLM context |
-| **LLM Output Management** | VARIANT for flexible response storage, VECTOR
for semantic indexing |
-| **Recommendation Systems** | VECTOR similarity for collaborative filtering,
incremental re-embedding |
-| **Content Moderation** | BLOB for raw content + VECTOR for content
embeddings + incremental processing |
-| **Multimodal Analytics** | Structured metadata + VECTOR embeddings + BLOB
raw data in one table |
-| **ML Feature Store** | VECTOR for feature embeddings, VARIANT for sparse
feature maps, time-travel for point-in-time retrieval |
-| **Experiment Tracking** | VARIANT for heterogeneous model configs and
metrics, incremental queries for latest runs |
-| **Data Labeling Pipelines** | BLOB for raw data, incremental queries for
unlabeled data, ACID for label updates |
-
-## Getting Started
-
-The fastest way to try these features is the **[AI Quick Start
Guide](ai-quick-start-guide.md)**, which walks
-you through an end-to-end image similarity search pipeline in under 30 minutes.
diff --git a/website/docs/blob_unstructured_data.md
b/website/docs/blob_unstructured_data.md
deleted file mode 100644
index 253656565084..000000000000
--- a/website/docs/blob_unstructured_data.md
+++ /dev/null
@@ -1,350 +0,0 @@
----
-title: "Unstructured Data"
-keywords: [ hudi, blob, unstructured data, images, binary, pdf, audio, video,
inline, out-of-line, read_blob]
-summary: "Store and query unstructured data (images, PDFs, audio, video) in
Hudi tables using the BLOB type with inline or out-of-line storage"
-toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
----
-
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-The `BLOB` type lets you store raw binary data — images, PDFs, audio clips,
video files, model weights —
-directly in Hudi tables. Combined with Hudi's transactional guarantees and
table services, BLOB makes
-the data lakehouse a single source of truth for both structured and
unstructured data.
-
-## BLOB Type Overview
-
-A BLOB column stores binary data in one of two modes:
-
-| Mode | Storage | Table Footprint | Read Pattern |
-|:-----|:--------|:----------------|:-------------|
-| **Inline** | Raw bytes embedded in the table row | Larger (bytes stored in
data files) | Direct read — no external fetch |
-| **Out-of-line** | Pointer to external storage location | Very small (< 1% of
data size) | On-demand via `read_blob()` |
-
-Choose **inline** for small objects that are frequently read together
(thumbnails, short audio clips,
-small documents). Choose **out-of-line** for large objects where you typically
query metadata first and
-fetch raw data selectively (high-res images, video, model checkpoints).
-
-## Creating Tables with BLOB Columns
-
-<Tabs
-groupId="api-style"
-defaultValue="sql"
-values={[
-{ label: 'Spark SQL', value: 'sql', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql">
-
-```sql
-CREATE TABLE media_assets (
- asset_id STRING,
- file_name STRING,
- mime_type STRING,
- file_size BIGINT,
- content BLOB
-) USING hudi
-TBLPROPERTIES (
- primaryKey = 'asset_id',
- type = 'cow'
-);
-```
-
-The `BLOB` keyword in DDL automatically configures the column with the correct
internal structure.
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-import pyarrow as pa
-
-schema = pa.schema([
- pa.field("asset_id", pa.string()),
- pa.field("file_name", pa.string()),
- pa.field("mime_type", pa.string()),
- pa.field("file_size", pa.int64()),
- pa.field("content", pa.struct([
- pa.field("type", pa.string()),
- pa.field("data", pa.binary()),
- pa.field("reference", pa.struct([
- pa.field("external_path", pa.string()),
- pa.field("offset", pa.int64()),
- pa.field("length", pa.int64()),
- pa.field("managed", pa.bool_()),
- ])),
- ]), metadata={b"hudi_type": b"BLOB"}),
-])
-```
-
-The BLOB internal structure is a struct with three fields:
-- `type` — `"INLINE"` or `"OUT_OF_LINE"`
-- `data` — raw bytes (populated for inline, null for out-of-line)
-- `reference` — external storage pointer with subfields:
- - `external_path` — file path for out-of-line data
- - `offset` — byte offset in the file (null means read from start)
- - `length` — byte length to read (null means read to end of file)
- - `managed` — boolean. Only meaningful for `OUT_OF_LINE` blobs. Marks
whether Hudi owns the lifecycle of the referenced external file. **Not consumed
by the cleaner yet** — set the value to record intent, and a future cleaner
implementation will use it: `true` → cleaner may delete the external file when
the blob row is no longer referenced; `false` → cleaner will leave the external
file in place.
-
-</TabItem>
-</Tabs>
-
-## Writing Inline BLOBs
-
-Inline BLOBs embed the raw bytes directly in the table row.
-
-<Tabs
-groupId="api-style"
-defaultValue="sql"
-values={[
-{ label: 'Spark SQL', value: 'sql', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql">
-
-```sql
-INSERT INTO media_assets VALUES (
- 'asset_001',
- 'logo.png',
- 'image/png',
- 45230,
- named_struct(
- 'type', 'INLINE',
- 'data', /* binary literal or column reference */,
- 'reference', CAST(NULL AS STRUCT<external_path: STRING, offset:
BIGINT, length: BIGINT, managed: BOOLEAN>)
- )
-);
-```
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-from pyspark.sql import Row
-from pyspark.sql.types import *
-
-# Read raw bytes
-with open("logo.png", "rb") as f:
- raw_bytes = f.read()
-
-row = Row(
- asset_id="asset_001",
- file_name="logo.png",
- mime_type="image/png",
- file_size=len(raw_bytes),
- content=Row(type="INLINE", data=raw_bytes, reference=None)
-)
-```
-
-</TabItem>
-</Tabs>
-
-## Writing Out-of-Line BLOBs
-
-Out-of-line BLOBs store a pointer to data in external storage. The actual
bytes live elsewhere (e.g.,
-a binary container file on S3, a separate object store path, or a shared
filesystem). The table only
-stores the reference metadata.
-
-```sql
-INSERT INTO media_assets VALUES (
- 'asset_002',
- 'video.mp4',
- 'video/mp4',
- 1073741824, -- 1 GB
- named_struct(
- 'type', 'OUT_OF_LINE',
- 'data', CAST(NULL AS BINARY),
- 'reference', named_struct(
- 'external_path', 's3://my-bucket/media/container_001.bin',
- 'offset', 8388608, -- byte offset in the container
- 'length', 1073741824, -- number of bytes
- 'managed', false -- intent flag; not consumed by
the cleaner yet
- )
- )
-);
-```
-
-### Container File Pattern
-
-A common pattern for out-of-line storage is to pack multiple objects into a
single binary container file:
-
-```
-container_001.bin
-├── [offset=0, len=45230] → logo.png
-├── [offset=45230, len=89012] → photo.jpg
-├── [offset=134242, len=1073741824] → video.mp4
-└── ...
-```
-
-Each BLOB row stores the `(external_path, offset, length)` triple. This avoids
creating millions of
-small files on object storage and enables efficient batch access.
-
-## Reading BLOBs
-
-### Querying Metadata (No Fetch)
-
-Standard queries on a BLOB column return the descriptor — not the raw bytes:
-
-```sql
-SELECT asset_id, file_name, content.type, content.reference.external_path
-FROM media_assets;
-```
-
-```
-+----------+-----------+-------------+----------------------------------------+
-| asset_id| file_name| type| external_path|
-+----------+-----------+-------------+----------------------------------------+
-| asset_001| logo.png| INLINE| null|
-| asset_002| video.mp4| OUT_OF_LINE| s3://my-bucket/media/container_001.bin |
-+----------+-----------+-------------+----------------------------------------+
-```
-
-This is fast and lightweight — no binary data is transferred.
-
-### Resolving Raw Bytes with read_blob()
-
-Use the `read_blob()` SQL function to materialize the actual bytes:
-
-```sql
--- Returns raw binary data
-SELECT asset_id, read_blob(content) AS raw_bytes
-FROM media_assets
-WHERE asset_id = 'asset_001';
-```
-
-For **inline** BLOBs, `read_blob()` simply extracts the embedded bytes.
-
-For **out-of-line** BLOBs, `read_blob()` reads from the external path at the
specified offset and length,
-transparently fetching the data on demand.
-
-:::tip
-Use `read_blob()` selectively — filter first, then resolve. Avoid `SELECT
read_blob(content) FROM large_table`
-without a WHERE clause, as this will fetch all raw data.
-:::
-
-## Use Cases
-
-### Image Datasets for Computer Vision
-
-Store training images alongside metadata and embeddings:
-
-```sql
-CREATE TABLE training_images (
- image_id STRING,
- label STRING,
- split STRING, -- 'train', 'val', 'test'
- embedding VECTOR(1024),
- raw_image BLOB
-) USING hudi TBLPROPERTIES (...);
-
--- Get raw images for a specific label
-SELECT image_id, read_blob(raw_image) AS pixels
-FROM training_images
-WHERE label = 'cat' AND split = 'train';
-```
-
-### Document Store for RAG Pipelines
-
-Store PDF documents alongside their chunk embeddings:
-
-```sql
-CREATE TABLE knowledge_base (
- doc_id STRING,
- chunk_id STRING,
- source_url STRING,
- text STRING,
- embedding VECTOR(1536),
- original BLOB -- original PDF bytes
-) USING hudi TBLPROPERTIES (...);
-
--- Retrieve full document after vector search
-SELECT doc_id, source_url, read_blob(original) AS pdf_bytes
-FROM knowledge_base
-WHERE doc_id IN (SELECT doc_id FROM top_matches);
-```
-
-### Audio/Video Processing Pipelines
-
-```sql
-CREATE TABLE audio_clips (
- clip_id STRING,
- transcript STRING,
- duration DOUBLE,
- embedding VECTOR(512),
- audio BLOB
-) USING hudi TBLPROPERTIES (...);
-```
-
-## Storage Efficiency
-
-Out-of-line BLOBs keep the Hudi table footprint extremely small:
-
-| Metric | Inline | Out-of-Line |
-|:-------|:-------|:------------|
-| Table size vs. raw data | ~100% | < 1% |
-| Query metadata without fetch | Requires reading data files | Only reads
pointer columns |
-| Random access to raw data | Read full row | Seek to (offset, length) |
-| Best for object size | < 1 MB | > 1 MB |
-
-## Configuration Reference
-
-| Property | Default | Description |
-|:---------|:--------|:------------|
-| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs
are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference
pointing at the in-file coordinates of the bytes — no bytes are materialized.
`CONTENT` materializes the raw inline bytes directly in the `data` field on
every read. |
-| `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap (in bytes)
between consecutive byte ranges before they are merged into a single read.
Larger values reduce I/O calls at the cost of reading some unused bytes. |
-| `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for
batch read detection. Larger values improve batching for sorted data but
increase memory usage. |
-
-:::note
-`DESCRIPTOR` mode is the default for all storage formats including Lance.
`CONTENT` mode is always
-used for internal operations (compaction, merge, log replay) regardless of
this setting.
-:::
-
-:::caution Calling read_blob() on INLINE columns under DESCRIPTOR mode
-Under the default `DESCRIPTOR` mode, calling `read_blob()` on an INLINE BLOB
column **throws** —
-the raw bytes are not materialized in the scan, so there is nothing for
`read_blob()` to return.
-To read inline bytes with `read_blob()`, switch to `CONTENT` mode first:
-
-```sql
-SET hoodie.read.blob.inline.mode=CONTENT;
-SELECT asset_id, read_blob(content) AS raw_bytes
-FROM media_assets
-WHERE asset_id = 'asset_001';
-```
-
-This setting affects only INLINE columns — OUT_OF_LINE columns always fetch
from the external path
-regardless of mode.
-:::
-
-## Metastore Sync
-
-When syncing BLOB column schemas to Hive or BigQuery, Hudi maps the BLOB
struct to the target
-catalog's native struct type:
-
-| Catalog | BLOB representation |
-|:--------|:-------------------|
-| Hive | `STRUCT<type:STRING, data:BINARY,
reference:STRUCT<external_path:STRING, offset:BIGINT, length:BIGINT,
managed:BOOLEAN>>` |
-| BigQuery | Equivalent `STRUCT` fields |
-
-The raw binary payload is preserved in the struct representation, but
`read_blob()` is a Spark SQL
-function and is not available in Hive or BigQuery directly.
-
-## Best Practices
-
-1. **Choose the right mode** — Use inline for small, frequently-accessed
objects. Use out-of-line for
- anything over 1 MB.
-
-2. **Filter before resolving** — Always apply WHERE predicates before calling
`read_blob()` to avoid
- unnecessary data transfer.
-
-3. **Batch container files** — When using out-of-line mode, pack multiple
objects into container files
- rather than storing one file per object.
-
-4. **Combine with VECTOR** — Pair BLOB columns with VECTOR columns for
powerful "search then retrieve"
- workflows: vector search narrows candidates, then `read_blob()` fetches
just the winners.
-
-5. **Use incremental queries** — Process only new BLOBs by leveraging Hudi's
incremental query support:
- ```sql
- SELECT * FROM hudi_table_changes('media_assets', 'latest_state',
'20260401000000');
- ```
diff --git a/website/docs/ingestion_flink.md b/website/docs/ingestion_flink.md
index 2e534e286d67..774b8abe206d 100644
--- a/website/docs/ingestion_flink.md
+++ b/website/docs/ingestion_flink.md
@@ -544,6 +544,26 @@ SELECT event_id, _hoodie_commit_time, payload FROM events;
Only `VIRTUAL` metadata columns are supported. All valid virtual columns
correspond to Hudi's built-in meta fields (`_hoodie_commit_time`,
`_hoodie_commit_seqno`, `_hoodie_record_key`, `_hoodie_partition_path`,
`_hoodie_file_name`, `_hoodie_operation`).
:::
+## Engine constraints for types
+
+Flink behavior of the new column types ([`VECTOR`](sql_ddl.md#vector),
+[`BLOB`](sql_ddl.md#blob), [`VARIANT`](sql_ddl.md#variant)) differs from Spark
in a few places:
+
+`VECTOR` columns are stored as Parquet `FIXED_LEN_BYTE_ARRAY`, which Flink's
Parquet reader does
+not convert back into a typed array. The other columns of the same table read
fine; only the
+`VECTOR` column itself is inaccessible from Flink. Use Spark to query `VECTOR`
columns.
+
+Native `VARIANT` operations require Flink 2.1+. Flink < 2.1 throws
+`UnsupportedOperationException` on VARIANT columns. On Flink 2.1+, VARIANT
columns surface as
+`ROW<metadata BYTES, value BYTES>`. Flink can read the underlying struct but
cannot decode it as a
+variant value.
+
+`read_blob()` is a Spark SQL function. From Flink, queries on a `BLOB` column
return the underlying
+struct directly.
+
+The Lance base file format is Spark-only. Reading a Lance-backed table from
Flink throws
+`HoodieValidationException`; see [Storage Layouts →
Lance](storage_layouts.md#lance-base-file-format).
+
## Advanced Options
### Hadoop Configuration Pass-through
diff --git a/website/docs/lance_file_format.md
b/website/docs/lance_file_format.md
deleted file mode 100644
index 9754a4e31797..000000000000
--- a/website/docs/lance_file_format.md
+++ /dev/null
@@ -1,220 +0,0 @@
----
-title: "Lance File Format"
-keywords: [ hudi, lance, file format, vector, AI, ML, columnar, ANN, indexing]
-summary: "Use the Lance columnar file format with Hudi for vector-friendly
storage and efficient ML workloads"
-toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
----
-
-[Lance](https://lancedb.github.io/lance/) is a modern columnar data format
designed for AI and machine learning
-workloads. Hudi's pluggable storage architecture lets you use Lance as the
base file format alongside Parquet
-and ORC, unlocking vector indexing, fast random access, and optimized
high-dimensional array storage.
-
-:::caution Engine Support
-Lance file format support is **Spark-only**. Attempting to read a Lance-backed
table from Flink or Hive throws a
-`HoodieValidationException`:
-> Lance base file format is currently only supported with the Spark engine.
Please use Parquet, ORC, or HFile
-> for non-Spark engines (Flink, Hive, Presto, Trino).
-
-The Lance JAR is **not bundled** in the Hudi distribution — you must add it to
your Spark classpath
-(see [Required Dependencies](#required-dependencies)).
-:::
-
-## Enabling Lance in Hudi
-
-### Table Creation (COW)
-
-Set the base file format to `lance` in table properties:
-
-```sql
-CREATE TABLE my_ai_table (
- id STRING,
- embedding VECTOR(768),
- metadata STRING
-) USING hudi
-TBLPROPERTIES (
- primaryKey = 'id',
- type = 'cow',
- hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
- hoodie.table.base.file.format = 'lance'
-);
-```
-
-### Table Creation (MOR)
-
-Lance base files work with MOR tables — Lance files act as base files while
Avro log files capture
-incremental changes. Log compaction merges the delta log back into Lance base
files.
-
-```sql
-CREATE TABLE my_ai_table_mor (
- id STRING,
- embedding VECTOR(768),
- metadata STRING
-) USING hudi
-TBLPROPERTIES (
- primaryKey = 'id',
- type = 'mor',
- hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
- hoodie.table.base.file.format = 'lance'
-);
-```
-
-### DataFrame API
-
-```python
-(df.write
- .format("hudi")
- .option("hoodie.table.name", "my_ai_table")
- .option("hoodie.datasource.write.recordkey.field", "id")
- .option("hoodie.record.merger.impls",
- "org.apache.hudi.DefaultSparkRecordMerger")
- .option("hoodie.table.base.file.format", "lance")
- .mode("overwrite")
- .save("/path/to/my_ai_table"))
-```
-
-### Required Dependencies
-
-The Lance JAR is not bundled in Hudi. Add the appropriate Lance Spark bundle
JAR to your Spark classpath:
-
-| Spark Version | Bundle JAR (Maven Central) |
-|:--------------|:---------------------------|
-| Spark 3.4 | `org.lance:lance-spark-bundle-3.4_2.12:0.4.0` |
-| Spark 3.5 | `org.lance:lance-spark-bundle-3.5_2.12:0.4.0` |
-| Spark 4.0 | `org.lance:lance-spark-bundle-4.0_2.13:0.4.0` |
-| Spark 4.1 | `org.lance:lance-spark-bundle-4.1_2.13:0.4.0` |
-
-```bash
-export LANCE_BUNDLE_JAR=/path/to/lance-spark-bundle-3.5_2.12-0.4.0.jar
-
-# Include both Hudi and Lance JARs
-spark-shell --jars $HUDI_BUNDLE_JAR,$LANCE_BUNDLE_JAR
-```
-
-## How Hudi + Lance Work Together
-
-Hudi manages the table layer — transactions, schema, timeline, table services
— while Lance handles the
-file-level storage:
-
-```
-┌───────────────────────────────────┐
-│ Hudi Table Layer │
-│ Timeline, Metadata, Indexing │
-│ Transactions, Schema Evolution │
-├───────────────────────────────────┤
-│ File Group / File Slice │
-│ (same Hudi concepts as Parquet) │
-├───────────────────────────────────┤
-│ Lance Data Files (.lance) │
-│ Columnar storage │
-│ Fragment-based layout │
-├───────────────────────────────────┤
-│ Storage (S3, GCS, HDFS, FS) │
-└───────────────────────────────────┘
-```
-
-All Hudi table services work with Lance-backed tables:
-
-- **Compaction** — merges log files into Lance base files
-- **Clustering** — reorganizes Lance files for better data locality
-- **Cleaning** — removes old Lance file versions
-- **Metadata indexing** — bloom filters work across Lance files; column stats
and partition stats are
- **automatically disabled** for Lance tables
-
-## VECTOR Storage on Lance
-
-VECTOR columns are stored natively in Lance as `FixedSizeList<Float32/Float64,
dim>` — Lance's own
-vector column encoding, so embeddings are written without conversion overhead
at the file-format
-layer.
-
-Only **FLOAT** and **DOUBLE** element types are supported as VECTOR columns on
Lance. INT8 vectors
-are not yet supported and will fail fast at write time.
-
-See [Vector Search](vector_search.md) for the `hudi_vector_search` TVF that
queries VECTOR columns.
-
-## BLOB Columns on Lance
-
-INLINE BLOB columns on Lance default to `DESCRIPTOR` read mode — standard
queries return an
-out-of-line-shaped reference descriptor rather than materializing the raw
bytes. To read inline
-byte content via `read_blob()`, set `hoodie.read.blob.inline.mode=CONTENT`. See
-[Unstructured Data](blob_unstructured_data.md) for full documentation.
-
-## Schema Evolution
-
-Lance supports the following schema changes at the Hudi layer:
-
-| Operation | Supported? |
-|:----------|:-----------|
-| Add column | Yes |
-| Rename column | Yes (via Hudi schema evolution) |
-| Promote `FLOAT` → `DOUBLE` | **No** — not supported on Lance |
-| Promote `FLOAT` → `STRING` | **No** — not supported on Lance |
-| Drop column | Yes |
-
-:::caution
-`FLOAT → DOUBLE` and `FLOAT → STRING` type promotions are supported for
Parquet tables but **not**
-for Lance. Attempting these on a Lance table will fail. Use `DOUBLE` from the
start if you anticipate
-needing higher precision.
-:::
-
-## Vector Search with Lance
-
-Use the `hudi_vector_search` TVF to run vector similarity queries against
VECTOR columns on a
-Lance-backed table:
-
-```sql
-SELECT id, metadata, _hudi_distance
-FROM hudi_vector_search(
- 'my_ai_table', 'embedding',
- ARRAY(0.1, 0.2, ...), -- query vector
- 10, 'cosine'
-)
-ORDER BY _hudi_distance;
-```
-
-See [Vector Search](vector_search.md) for full documentation on the TVF and
distance metrics.
-
-## Configuration Reference
-
-| Property | Default | Description |
-|:---------|:--------|:------------|
-| `hoodie.table.base.file.format` | `parquet` | Set to `lance` to use Lance as
the base file format. |
-| `hoodie.record.merger.impls` | — | Must be
`org.apache.hudi.DefaultSparkRecordMerger` for Lance. |
-| `hoodie.lance.max.file.size` | `125829120` (120 MiB) | Target file size in
bytes for Lance base files. |
-| `hoodie.lance.write.allocator.size.bytes` | `268435456` (256 MiB) | Maximum
size of the Arrow child allocator used for buffering in-flight batch data.
Increase for tables with very large BLOB columns. |
-| `hoodie.lance.write.flush.byte.watermark` | `100663296` (96 MiB) | Byte-size
threshold at which the current write batch is flushed. Must be less than
`hoodie.lance.write.allocator.size.bytes`. |
-
-### File Sizing and Memory
-
-The three sizing configs work together:
-
-- **`hoodie.lance.max.file.size`** controls when Hudi rolls over to a new
Lance file, similar to
- `hoodie.parquet.max.file.size` for Parquet tables.
-- **`hoodie.lance.write.allocator.size.bytes`** caps the Arrow allocator's
in-flight memory. Arrow
- uses power-of-2 buffer doubling; the default 256 MiB accommodates the 128
MiB doubling step with
- headroom.
-- **`hoodie.lance.write.flush.byte.watermark`** triggers an early batch flush
when Arrow buffers
- approach the cap. The default 96 MiB (≈ 3/8 of the allocator cap) leaves
room for offset and
- validity buffers to double without exceeding the allocator limit.
-
-For tables with large BLOB columns, increase both
`hoodie.lance.write.allocator.size.bytes` and
-`hoodie.lance.write.flush.byte.watermark` proportionally (keep watermark at
roughly 3/8 of allocator
-size).
-
-## Additional Notes
-
-- **`populateMetaFields=false`** is supported. User-defined key generators
work normally with Lance
- tables.
-- **Complex types** (struct, array, map) are supported as Lance columns.
-- **VARIANT columns** are **not supported** on Lance. Attempting to write a
table with VARIANT columns
- to Lance throws a `HoodieNotSupportedException`. Use Parquet for tables with
VARIANT columns.
-
-## Mixed-Format Tables
-
-Hudi does not require all tables in a lakehouse to use the same file format.
You can have:
-
-- **Analytical tables** on Parquet — for traditional BI/SQL workloads
-- **AI tables** on Lance — for embeddings, vector search, and ML feature
storage
-- **Archival tables** on ORC — for compressed long-term storage
-
-All share the same Hudi catalog, metadata, and tooling.
diff --git a/website/docs/overview.mdx b/website/docs/overview.mdx
index ed439cef38a1..cc8c7707927b 100644
--- a/website/docs/overview.mdx
+++ b/website/docs/overview.mdx
@@ -51,21 +51,10 @@ Sometimes the fastest way to learn is by doing. Try out
these Quick Start resour
- [Spark Quick Start Guide](quick-start-guide.md) – if you primarily use
Apache Spark
- [Flink Quick Start Guide](flink-quick-start-guide.md) – if you primarily use
Apache Flink
- [Python/Rust Quick Start Guide (Hudi-rs)](python-rust-quick-start-guide.md)
- if you primarily use Python or Rust
-- [AI Quick Start Guide](ai-quick-start-guide.md) – if you are building AI/ML
pipelines with embeddings, vector search, or unstructured data
+- [Unstructured Data Quick Start
Guide](unstructured-data-quick-start-guide.md) – store embeddings (`VECTOR`)
and raw bytes (`BLOB`) and run similarity search with `hudi_vector_search`
If you want to experience Apache Hudi integrated into an end to end demo with
Kafka, Spark, Hive, Presto, etc, try out the [Docker Demo](docker_demo.md)
-## AI-Native Capabilities
-
-Hudi brings first-class support for AI and unstructured data workloads to the
data lakehouse:
-
-- **[VECTOR type & Similarity Search](vector_search.md)** — Store embeddings
and run vector similarity search directly in Spark SQL
-- **[BLOB type for Unstructured Data](blob_unstructured_data.md)** — Store
images, PDFs, audio, and other binary data with inline or out-of-line storage
-- **[VARIANT type for Semi-Structured Data](variant_type.md)** — Store
flexible JSON-like data (LLM outputs, model metadata, feature maps) without
rigid schemas
-- **[Lance File Format](lance_file_format.md)** — Vector-friendly columnar
format for AI/ML workloads
-
-See the full [AI-Native Lakehouse Overview](ai_overview.md) for use cases and
architecture.
-
## Connect With The Community
Apache Hudi is community-focused and community-led and welcomes new-comers
with open arms. Leverage the following
resources to learn more, engage, and get help as you get started.
diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index 2fc0d421dab2..3792f3248b4d 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -146,6 +146,183 @@ TBLPROPERTIES (
LOCATION 'file:///tmp/hudi_table_merge_mode_custom/';
```
+### Create table with unstructured and semi-structured column types
{#create-table-with-unstructured-and-semi-structured-column-types}
+
+Hudi supports three column types for unstructured and semi-structured data,
plus a Lance base file
+format.
+
+#### `VECTOR(dim[, elementType])` {#vector}
+
+A fixed-dimension embedding column. `elementType` is one of `FLOAT` (default),
`DOUBLE`, or `INT8`
+(alias `BYTE`):
+
+| Element type | Storage |
+|:-------------|:--------|
+| `FLOAT` (default) | `ArrayType(FloatType)` |
+| `DOUBLE` | `ArrayType(DoubleType)` |
+| `INT8` / `BYTE` | `ArrayType(ByteType)` |
+
+```sql
+CREATE TABLE products (
+ product_id STRING,
+ name STRING,
+ embedding VECTOR(768) -- defaults to FLOAT
+) USING hudi
+TBLPROPERTIES (
+ primaryKey = 'product_id',
+ type = 'cow',
+ hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger'
+);
+
+-- Other element types
+-- embedding VECTOR(768, FLOAT)
+-- embedding VECTOR(768, DOUBLE)
+-- embedding VECTOR(256, INT8)
+```
+
+Hudi's SQL parser normalizes `VECTOR(128, FLOAT)` to `VECTOR(128)`.
+
+VECTOR columns must be **top-level fields**; nesting inside `STRUCT`, `ARRAY`,
or `MAP` is not
+supported. Dimension and element type cannot be changed via schema evolution
after table creation.
+
+**Internal storage.** In Parquet, a VECTOR column is stored as
`FIXED_LEN_BYTE_ARRAY` with
+`hudi_type=VECTOR(dim[,elementType])` field metadata so the Spark reader
decodes the bytes back into
+a typed array. On Lance-backed tables, VECTOR is stored natively as
+`FixedSizeList<Float32/Float64, dim>`; only `FLOAT` and `DOUBLE` element types
are accepted on Lance.
+
+Query VECTOR columns with the [`hudi_vector_search`
TVF](sql_queries.md#vector-similarity-search).
+
+**Engine constraints.** Flink cannot decode VECTOR columns (the underlying
Parquet
+`FIXED_LEN_BYTE_ARRAY` is not converted back into a typed array); Flink can
still read other columns
+in a table that contains a VECTOR column.
+
+#### `BLOB` {#blob}
+
+A binary column with two storage modes:
+
+| Mode | Storage | Read pattern |
+|:-----|:--------|:-------------|
+| `INLINE` | Raw bytes embedded in the table row | Direct read; no external
fetch |
+| `OUT_OF_LINE` | Pointer to a byte range in an external file | On-demand via
[`read_blob()`](sql_queries.md#reading-blob-columns) |
+
+```sql
+CREATE TABLE media_assets (
+ asset_id STRING,
+ file_name STRING,
+ mime_type STRING,
+ file_size BIGINT,
+ content BLOB
+) USING hudi
+TBLPROPERTIES (
+ primaryKey = 'asset_id',
+ type = 'cow'
+);
+```
+
+**Internal struct.** A BLOB column is represented internally as:
+
+```
+STRUCT<
+ type: STRING, -- 'INLINE' or 'OUT_OF_LINE'
+ data: BINARY, -- raw bytes for INLINE; null for OUT_OF_LINE
+ reference: STRUCT<
+ external_path: STRING, -- file path for OUT_OF_LINE
+ offset: BIGINT, -- byte offset; null = start of file
+ length: BIGINT, -- byte length; null = read to end
+ managed: BOOLEAN -- advisory flag for the (future) cleaner
+ >
+>
+```
+
+`managed` is currently advisory: `true` records the intent that Hudi owns the
lifecycle of the
+referenced external file (so a future cleaner may delete it when no row
references it), `false`
+records that the file is externally managed. The cleaner does not yet consume
this field.
+
+BLOB columns are excluded from column-stats indexing.
+
+#### `VARIANT` {#variant}
+
+A semi-structured (JSON-like) column that stores any JSON-compatible value
(object, array, string,
+number, boolean, null) as two binary fields:
+
+| Field | Description |
+|:------|:------------|
+| `metadata` | Encodes field names, types, and structure for efficient access |
+| `value` | The data payload |
+
+On Spark 4.0+, declare the native type:
+
+```sql
+CREATE TABLE events (
+ event_id STRING,
+ payload VARIANT,
+ ts BIGINT
+) USING hudi
+TBLPROPERTIES (
+ primaryKey = 'event_id',
+ preCombineField = 'ts'
+);
+```
+
+Spark 3.x has no native `VariantType`. To read a VARIANT-bearing Hudi table
written by Spark 4.x,
+declare the column as a binary struct in the `CREATE TABLE` DDL pointing at
the table location:
+
+```sql
+CREATE TABLE events (
+ event_id STRING,
+ payload STRUCT<value: BINARY, metadata: BINARY>,
+ ts BIGINT
+) USING hudi
+LOCATION '<existing-table-path>'
+TBLPROPERTIES (
+ primaryKey = 'event_id',
+ preCombineField = 'ts'
+);
+```
+
+Spark 3.x then returns the raw `metadata` and `value` bytes; it does not
surface the column as a
+logical VARIANT, and reading a VARIANT table without this explicit DDL (i.e.
letting Hudi
+auto-resolve the schema from commit metadata) throws `"VARIANT type is only
supported in Spark
+4.0+"`.
+
+Engine support for `VARIANT`:
+
+| Engine | Behavior |
+|:-------|:---------|
+| Spark 4.0+ | Native `VariantType` for read/write/query on COW and MOR
(CREATE TABLE with `VARIANT` or DataFrame writes with `VariantType`). |
+| Spark 4.1 | Same as Spark 4.0. Spark 4.1's `PushVariantIntoScan` rewrites
VARIANT projections into struct-of-extractions; Hudi recognizes that shape and
returns the column as a logical VARIANT. |
+| Spark 3.x | No native VARIANT. Backward-compat read of a Spark 4.x-written
table requires the explicit binary-struct DDL above; raw bytes only. |
+| Flink < 2.1 | Throws `UnsupportedOperationException` on VARIANT columns. |
+| Flink ≥ 2.1 | Surfaces VARIANT as `ROW<metadata BYTES, value BYTES>`. Flink
can read the underlying struct but cannot decode it as a variant value. |
+
+VARIANT columns are not supported on Lance-backed tables.
+
+Only unshredded VARIANT is exposed at the user-facing level: every Spark/Hudi
schema conversion
+produces an unshredded VARIANT (two binary fields, `metadata` and `value`).
The shredded variant
+write path exists in the engine but has no DDL, table-property, or
session-config to enable it.
+
+#### Lance base file format
+
+Set the base file format per-table:
+
+```sql
+CREATE TABLE my_ai_table (
+ id STRING,
+ embedding VECTOR(768),
+ metadata STRING
+) USING hudi
+TBLPROPERTIES (
+ primaryKey = 'id',
+ type = 'cow',
+ hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
+ hoodie.table.base.file.format = 'lance'
+);
+```
+
+Lance also works with MOR tables: Lance files act as base files while Avro log
files capture
+incremental changes. See [Storage Layouts →
Lance](storage_layouts.md#lance-base-file-format) for
+configuration, dependencies, and behavior.
+
### Create table from an external location
Often, Hudi tables are created from streaming writers like the [streamer
tool](hoodie_streaming_ingestion.md#hudi-streamer), which
may later need some SQL statements to run on them. You can create an External
table using the `location` statement.
@@ -1003,20 +1180,11 @@ WITH (
| array | array | |
| map | map | |
| struct | struct | |
+| `ArrayType(<elementType>)` with field metadata `hudi_type=VECTOR(dim[,
elementType])` | VECTOR | Fixed-dimension embedding column. Element type is
`FloatType`, `DoubleType`, or `ByteType` (INT8). See [Create table with
unstructured and semi-structured column
types](#create-table-with-unstructured-and-semi-structured-column-types). |
+| `StructType(type STRING, data BINARY, reference STRUCT<…>)` with field
metadata `hudi_type=BLOB` | BLOB | Binary column with `INLINE` / `OUT_OF_LINE`
storage. See [Create table with unstructured and semi-structured column
types](#create-table-with-unstructured-and-semi-structured-column-types). |
+| `VariantType` (Spark 4.0+) or `StructType(metadata BINARY NOT NULL, value
BINARY NOT NULL)` with field metadata `hudi_type=VARIANT` (Spark 3.x) | VARIANT
| Semi-structured (JSON-like) column (unshredded). See [Create table with
unstructured and semi-structured column
types](#create-table-with-unstructured-and-semi-structured-column-types). |
| char | | not supported |
| varchar | | not supported |
| numeric | | not supported |
| null | | not supported |
| object | | not supported |
-
-### AI and Unstructured Data Types
-
-Hudi 1.2.0 introduces two additional column types for AI and unstructured data
workloads:
-
-- **`VECTOR(dim[, elementType])`** — stores fixed-dimension embedding vectors
(e.g. `VECTOR(768)`,
- `VECTOR(768, FLOAT)`, `VECTOR(768, DOUBLE)`). Enables approximate
nearest-neighbor search via
- the `hudi_vector_search` TVF. See [Vector Search](vector_search.md) for full
details.
-
-- **`BLOB`** — stores arbitrary binary objects (images, audio, documents)
either inline within the
- base file or as external references. See [BLOB / Unstructured
Data](blob_unstructured_data.md)
- for the storage modes, DDL syntax, and read APIs.
diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md
index d3caf5511fa2..43e38e813971 100644
--- a/website/docs/sql_dml.md
+++ b/website/docs/sql_dml.md
@@ -48,6 +48,81 @@ INSERT INTO hudi_cow_pt_tbl PARTITION(dt = '2021-12-09',
hh='11') SELECT 2, 'a2'
INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name,
1000 AS ts, '2021-12-09' AS dt, '10' AS hh;
```
+#### Inserting VECTOR columns
+
+A [`VECTOR(dim[, elementType])`](sql_ddl.md#vector) column accepts an `ARRAY`
whose length and
+element type match the declared dimension and element type:
+
+```sql
+INSERT INTO products VALUES (
+ 'prod_001', 'Running Shoes', 'Lightweight trail runner',
+ ARRAY(0.123, -0.456, 0.789, /* ... 768 floats ... */)
+);
+```
+
+The query vector's element type must exactly match the corpus embedding's
element type (no implicit
+casting).
+
+#### Inserting BLOB columns
+
+A [`BLOB`](sql_ddl.md#blob) column is internally a struct with `type`
(`INLINE`/`OUT_OF_LINE`),
+`data`, and a `reference` pointer struct. Construct it via `named_struct`.
+
+**Inline** — bytes embedded directly in the row:
+
+```sql
+INSERT INTO media_assets VALUES (
+ 'asset_001', 'logo.png', 'image/png', 45230,
+ named_struct(
+ 'type', 'INLINE',
+ 'data', /* binary literal or column reference */,
+ 'reference', CAST(NULL AS STRUCT<external_path: STRING, offset:
BIGINT, length: BIGINT, managed: BOOLEAN>)
+ )
+);
+```
+
+**Out-of-line** — pointer to a byte range in an external file:
+
+```sql
+INSERT INTO media_assets VALUES (
+ 'asset_002', 'video.mp4', 'video/mp4', 1073741824,
+ named_struct(
+ 'type', 'OUT_OF_LINE',
+ 'data', CAST(NULL AS BINARY),
+ 'reference', named_struct(
+ 'external_path', 's3://my-bucket/media/container_001.bin',
+ 'offset', 8388608,
+ 'length', 1073741824,
+ 'managed', false
+ )
+ )
+);
+```
+
+A common pattern for `OUT_OF_LINE` storage is to pack multiple objects into a
single binary
+container file and have each BLOB row store the `(external_path, offset,
length)` triple into that
+container:
+
+```
+container_001.bin
+├── [offset=0, len=45230] → logo.png
+├── [offset=45230, len=89012] → photo.jpg
+├── [offset=134242, len=1073741824] → video.mp4
+└── ...
+```
+
+#### Inserting VARIANT columns
+
+A [`VARIANT`](sql_ddl.md#variant) column accepts any JSON-compatible value. On
Spark 4.0+, build
+VARIANT values with `parse_json`:
+
+```sql
+INSERT INTO events VALUES
+ ('evt_001', parse_json('{"action": "click", "x": 120, "y": 450}'), 1000),
+ ('evt_002', parse_json('{"action": "purchase", "items": ["sku_a",
"sku_b"], "total": 59.99}'), 1001),
+ ('evt_003', parse_json('"simple string value"'), 1002);
+```
+
:::note Mapping to write operations
Hudi offers flexibility in choosing the underlying [write
operation](write_operations.md) of a `INSERT INTO` statement using
the `hoodie.spark.sql.insert.into.operation` configuration. Possible options
include *"bulk_insert"* (large inserts), *"insert"* (with small file
management),
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
index 4b9d6c2dd695..efa067c77039 100644
--- a/website/docs/sql_queries.md
+++ b/website/docs/sql_queries.md
@@ -352,16 +352,186 @@ time to completion time depending on the source table
version.
### Vector Similarity Search
-Hudi 1.2.0 introduces a `hudi_vector_search` table-valued function (TVF) for
approximate
-nearest-neighbor (ANN) search over `VECTOR` columns. This is an extension of
the
-`hudi_table_changes` TVF pattern.
+The `hudi_vector_search` table-valued function (TVF) runs top-K similarity
search over a
+[`VECTOR`](sql_ddl.md#vector) column. It returns the `top_k` rows whose VECTOR
column is closest to
+a query vector under the chosen distance metric.
```sql
--- Find the 10 nearest neighbors to a query vector in the 'embedding' column
-SELECT * FROM hudi_vector_search('db.embeddings_table', 'embedding',
ARRAY(0.1, 0.2, ...), 10);
+SELECT *
+FROM hudi_vector_search(
+ table_name, -- STRING: registered table name or path
+ vector_column, -- STRING: VECTOR column name
+ query_vector, -- ARRAY: query embedding
+ top_k, -- INT: number of nearest neighbors
+ [distance_metric], -- STRING: 'cosine' (default), 'l2', 'dot_product'
+ [algorithm] -- STRING: 'brute_force' (default)
+)
+```
+
+Parameters:
+
+| Parameter | Type | Default | Description |
+|:----------|:-----|:--------|:------------|
+| `table_name` | STRING | (required) | Registered table name or table path. |
+| `vector_column` | STRING | (required) | Name of the VECTOR column. |
+| `query_vector` | ARRAY<FLOAT> | (required) | Query embedding; must
match the column's dimension and element type. |
+| `top_k` | INT | (required) | Number of nearest neighbors. |
+| `distance_metric` | STRING | `'cosine'` | One of `'cosine'`, `'l2'`,
`'dot_product'`. |
+| `algorithm` | STRING | `'brute_force'` | Only `'brute_force'` is currently
supported. |
+
+Return schema: all columns from the source table (excluding the embedding
column) plus
+`_hudi_distance DOUBLE`. Results are ordered by `_hudi_distance` ascending, so
the closest matches
+come first.
+
+Distance metrics:
+
+| Metric | Formula | Range | Notes |
+|:-------|:--------|:------|:------|
+| `cosine` | 1 − cos(a, b), clamped to [0, 2] | [0, 2] | Returns 1.0 for zero
vectors. |
+| `l2` | sqrt(sum((a[i] − b[i])²)) | [0, +∞) | |
+| `dot_product` | −(a · b) | (−∞, +∞) | Negated so ascending sort surfaces the
most similar rows first. |
+
+```sql
+-- Find the 10 nearest neighbors to a query embedding
+SELECT product_id, name, _hudi_distance
+FROM hudi_vector_search('products', 'embedding',
+ ARRAY(0.12, -0.03, 0.87, /* ... */),
+ 10, 'cosine')
+ORDER BY _hudi_distance;
+```
+
+RAG context retrieval to apply a distance threshold to drop weak matches:
+
+```sql
+-- Retrieve the 5 most relevant document chunks for an LLM prompt
+SELECT chunk_id, text_content, _hudi_distance
+FROM hudi_vector_search(
+ 'document_chunks', 'embedding',
+ ARRAY(/* embedding of the user's question */),
+ 5, 'cosine'
+)
+WHERE _hudi_distance < 0.3;
+```
+
+Cross-modal search to query a corpus of image embeddings with a text embedding
(e.g., CLIP):
+
+```sql
+SELECT image_id, caption, _hudi_distance
+FROM hudi_vector_search(
+ 'image_catalog', 'clip_embedding',
+ ARRAY(/* text embedding from CLIP */),
+ 20, 'cosine'
+);
+```
+
+The TVF also works on Lance-backed tables. Set
`hoodie.table.base.file.format=lance` at table
+creation (see [Storage Layouts →
Lance](storage_layouts.md#lance-base-file-format)); the call site
+is unchanged.
+
+:::note
+`cosine` distance computes `1 − cos(a, b)`. If embeddings are not
L2-normalized before write,
+results reflect both vector direction and magnitude.
+:::
+
+#### `hudi_vector_search_batch`
+
+The batch variant runs many query vectors at once:
+
+```sql
+SELECT *
+FROM hudi_vector_search_batch(
+ corpus_table, -- STRING
+ corpus_embedding_col, -- STRING
+ query_table, -- STRING: table holding query vectors
+ query_embedding_col, -- STRING: VECTOR column in query table
+ top_k, -- INT: neighbors per query
+ [distance_metric],
+ [algorithm]
+)
+```
+
+Return schema: corpus columns + query columns + `_hudi_distance DOUBLE` +
+`_hudi_query_index LONG` (identifies the query that produced the row). If
corpus and query share
+column names, query columns are prefixed with `_hudi_query_`.
+
+### Reading BLOB Columns
+
+The `read_blob()` SQL function materializes raw bytes from a
[`BLOB`](sql_ddl.md#blob) column. It
+handles both storage modes uniformly: for `INLINE`, it extracts the embedded
bytes; for
+`OUT_OF_LINE`, it reads `reference.length` bytes starting at
`reference.offset` from
+`reference.external_path`.
+
+```sql
+SELECT asset_id, read_blob(content) AS raw_bytes
+FROM media_assets
+WHERE asset_id = 'asset_001';
+```
+
+Standard queries on a BLOB column return the underlying struct (descriptor +
bytes), not raw bytes:
+
+```sql
+SELECT asset_id, content.type, content.reference.external_path
+FROM media_assets;
+```
+
+#### Inline read mode
+
+| Property | Default | Description |
+|:---------|:--------|:------------|
+| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how `INLINE` BLOBs
surface on read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference
pointing at the in-file coordinates of the bytes. No bytes are materialized.
`CONTENT` materializes the raw inline bytes in the `data` field on every read. |
+| `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap between
consecutive byte ranges before they are merged into a single read. Larger
values reduce I/O calls at the cost of reading some unused bytes. |
+| `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for
batch-read detection. Larger values improve batching for sorted data but
increase memory usage. |
+
+`CONTENT` mode is always used for internal operations (compaction, merge, log
replay) regardless of
+this setting.
+
+:::caution Calling `read_blob()` on INLINE columns under DESCRIPTOR mode
+Under the default `DESCRIPTOR` mode, calling `read_blob()` on an `INLINE` BLOB
column throws.
+The raw bytes are not materialized in the scan, so there is nothing for
`read_blob()` to return.
+To read inline bytes with `read_blob()`, switch to `CONTENT` mode first:
+
+```sql
+SET hoodie.read.blob.inline.mode=CONTENT;
+SELECT asset_id, read_blob(content) AS raw_bytes
+FROM media_assets WHERE asset_id = 'asset_001';
+```
+
+This setting affects only `INLINE` columns. `OUT_OF_LINE` always fetches from
the external path.
+:::
+
+`read_blob()` is a Spark SQL function; Hive, BigQuery, and other engines
reading the underlying
+struct directly do not have it. Apply predicates before calling `read_blob()`
to bound the bytes
+resolved per row.
+
+### Querying VARIANT Columns
+
+[`VARIANT`](sql_ddl.md#variant) columns hold semi-structured (JSON-like) data.
Cast to `STRING` for
+JSON output:
+
+```sql
+SELECT event_id, cast(payload as STRING) AS payload_json FROM events;
+```
+
+VARIANT columns support `UPDATE`, `DELETE`, and `MERGE` on both COW and MOR
tables.
+
+#### End-to-end example
+
+`hudi_vector_search` and `read_blob()` compose in a single query that returns
both the matching
+rows and the materialized bytes for each:
+
+```sql
+SELECT image_id, category,
+ read_blob(image_bytes) AS resolved_bytes,
+ _hudi_distance
+FROM hudi_vector_search(
+ '/tmp/hudi_pets', 'embedding',
+ ARRAY(0.12, -0.03, /* ... */),
+ 5, 'cosine'
+)
+ORDER BY _hudi_distance;
```
-See [Vector Search](vector_search.md) for the full API, supported metrics, and
setup instructions.
+See [Unstructured Data Quick Start
Guide](unstructured-data-quick-start-guide.md) for a full end-to-end
walk-through.
### Query Indexes and Timeline
diff --git a/website/docs/storage_layouts.md b/website/docs/storage_layouts.md
index a88f584e3f9d..b069c914d5b1 100644
--- a/website/docs/storage_layouts.md
+++ b/website/docs/storage_layouts.md
@@ -28,9 +28,101 @@ base file formats.
* columnar formats for vectorized reads, columnar compression and efficient
column based access for analytics/data science.
* row-oriented avro files for fast scans for reading whole records.
* random access optimized HFiles for efficient searching for indexed records
(based on
[SSTable](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats)
format)
+* [Lance](#lance-base-file-format) files for native `VECTOR` storage and AI/ML
workloads (Spark-only).

+#### Lance base file format {#lance-base-file-format}
+
+[Lance](https://lancedb.github.io/lance/) is a pluggable base file format
selected per table via
+`hoodie.table.base.file.format = 'lance'`. Hudi manages the table layer
+(timeline, metadata, schema, file groups, table services); Lance is the
on-disk file format for
+base files. Log files for MOR tables remain Avro; log compaction merges Avro
logs into Lance base
+files.
+
+```sql
+-- COW
+CREATE TABLE my_ai_table (
+ id STRING,
+ embedding VECTOR(768),
+ metadata STRING
+) USING hudi
+TBLPROPERTIES (
+ primaryKey = 'id',
+ type = 'cow',
+ hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
+ hoodie.table.base.file.format = 'lance'
+);
+
+-- MOR (Lance base + Avro logs)
+CREATE TABLE my_ai_table_mor (
+ id STRING,
+ embedding VECTOR(768),
+ metadata STRING
+) USING hudi
+TBLPROPERTIES (
+ primaryKey = 'id',
+ type = 'mor',
+ hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
+ hoodie.table.base.file.format = 'lance'
+);
+```
+
+Hudi table services on Lance-backed tables behave as follows. Compaction
merges Avro log files
+into Lance base files. Clustering reorganizes records into new Lance files.
Cleaning removes
+obsolete Lance file slices. Bloom filter indexing is supported; column-stats
and partition-stats
+indices are automatically disabled for Lance base files. See
[Indexes](indexes.md) for the
+supported set.
+
+Type-specific behavior on Lance:
+
+- `VECTOR` columns are stored natively as Lance
`FixedSizeList<Float32/Float64, dim>` (FLOAT or
+ DOUBLE only; `INT8` is not supported on Lance and fails fast at write).
+- `BLOB` columns default to `DESCRIPTOR` read mode, same as Parquet.
+- `VARIANT` columns are not supported on Lance. Writing a table with a VARIANT
column to Lance
+ throws `HoodieNotSupportedException`. Use Parquet for VARIANT tables.
+- Complex types (`STRUCT`, `ARRAY`, `MAP`) are supported as Lance columns.
+- `populateMetaFields=false` is supported. User-defined key generators work
normally with
+ Lance-backed tables.
+
+Lance is Spark-only. Reading a Lance-backed table from Flink, Hive, Presto, or
Trino throws
+`HoodieValidationException`. Lance files are also non-splittable: a single
Spark task reads one
+Lance base file.
+
+The Lance JAR is not bundled in the Hudi distribution. Add the Lance Spark
bundle that matches
+your Spark version to the Spark classpath:
+
+| Spark version | Bundle (Maven Central) |
+|:--------------|:-----------------------|
+| Spark 3.4 | `org.lance:lance-spark-bundle-3.4_2.12:0.4.0` |
+| Spark 3.5 | `org.lance:lance-spark-bundle-3.5_2.12:0.4.0` |
+| Spark 4.0 | `org.lance:lance-spark-bundle-4.0_2.13:0.4.0` |
+| Spark 4.1 | `org.lance:lance-spark-bundle-4.1_2.13:0.4.0` |
+
+```bash
+export LANCE_BUNDLE_JAR=/path/to/lance-spark-bundle-3.5_2.12-0.4.0.jar
+spark-shell --jars $HUDI_BUNDLE_JAR,$LANCE_BUNDLE_JAR
+```
+
+##### File sizing and memory
+
+| Property | Default | Description |
+|:---------|:--------|:------------|
+| `hoodie.lance.max.file.size` | `125829120` (120 MiB) | Target file size in
bytes for Lance base files (analogous to `hoodie.parquet.max.file.size`). |
+| `hoodie.lance.write.allocator.size.bytes` | `268435456` (256 MiB) | Maximum
size of the Arrow child allocator used for buffering in-flight batch data.
Increase for tables with very large `BLOB` columns. |
+| `hoodie.lance.write.flush.byte.watermark` | `100663296` (96 MiB) | Byte-size
threshold at which the current write batch is flushed. Must be less than
`hoodie.lance.write.allocator.size.bytes`. |
+
+Arrow uses power-of-2 buffer doubling; the default 256 MiB allocator
accommodates a 128 MiB doubling
+step with headroom. The default 96 MiB watermark (~3/8 of the allocator cap)
leaves room for offset
+and validity buffers to double without exceeding the allocator limit. For
tables with large `BLOB`
+columns, increase the allocator and watermark together (keep watermark ≈ 3/8
of allocator).
+
+##### Mixed-format tables
+
+`hoodie.table.base.file.format` is set per table, so different tables in the
same lakehouse can use
+different base file formats (Parquet, ORC, HFile, Lance) under a shared Hudi
catalog and metadata
+table.
+
### Log Files
Log files store incremental changes (partial or full) to a base file, such as
updates, inserts, and deletes, after the base file was created.
diff --git a/website/docs/syncing_metastore.md
b/website/docs/syncing_metastore.md
index 059f172de003..d857affd8178 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -327,6 +327,27 @@ If the table is synced with `mode=hms` or `mode=hiveql`
against HMS 4.x, Hudi lo
**JDBC connection failures surface separately.** With `mode=jdbc`, Hudi opens
the JDBC connection eagerly when the sync client is constructed — before any
Thrift call is attempted. A bad JDBC URL, missing driver, or wrong credentials
therefore fails at startup with `HoodieHiveSyncException: Failed to create
HiveMetaStoreClient` and the underlying JDBC exception as the cause in the
stack trace. This is a configuration-error path, not an HMS API mismatch, and
is the same behavior as `mode= [...]
+## Metastore mapping for VECTOR, BLOB, and VARIANT
+
+When the column types ([`VECTOR`](sql_ddl.md#vector),
[`BLOB`](sql_ddl.md#blob),
+[`VARIANT`](sql_ddl.md#variant)) are synced to an external catalog, Hudi maps
them to the catalog's
+native binary/struct types:
+
+| Type | Hive | BigQuery |
+|:-----|:-----|:---------|
+| `VECTOR` | `BINARY` | `BYTES` |
+| `BLOB` | `STRUCT<type:STRING, data:BINARY,
reference:STRUCT<external_path:STRING, offset:BIGINT, length:BIGINT,
managed:BOOLEAN>>` | Equivalent `STRUCT` fields |
+| `VARIANT` | `STRUCT<metadata:BINARY, value:BINARY>` | `STRUCT` with
`metadata` and `value` fields (`BYTES`) |
+
+For `VECTOR`, the original `VECTOR(dim, elementType)` dimension and
element-type metadata is
+preserved in `TBLPROPERTIES`/table descriptions so the table can be correctly
reconstructed by
+Spark after a metastore round-trip.
+
+Engines that support `VARIANT` natively (Spark 4.0+, Flink 2.1+) read the
table directly using the
+Parquet VARIANT annotation and do not go through the Hive/BigQuery metastore
representation. The
+`read_blob()` SQL function is Spark-only. Hive and BigQuery queries on a BLOB
column read the
+underlying struct directly.
+
## Writer Version Table Property
Hudi 1.2.0 sync writes the table property `hudi_writer_version` (set to the
Hudi version that last synced the table) to the Hive metastore entry on every
sync. This allows tooling and metastore administrators to identify which Hudi
version wrote a given table.
diff --git a/website/docs/ai-quick-start-guide.md
b/website/docs/unstructured-data-quick-start-guide.md
similarity index 74%
rename from website/docs/ai-quick-start-guide.md
rename to website/docs/unstructured-data-quick-start-guide.md
index 9d8fa50ea16f..c0014ba815dc 100644
--- a/website/docs/ai-quick-start-guide.md
+++ b/website/docs/unstructured-data-quick-start-guide.md
@@ -1,21 +1,23 @@
---
-title: "AI Quick Start"
-keywords: [ hudi, ai, vector search, embeddings, unstructured data, blob,
machine learning, image search, similarity]
-summary: "Store embeddings and images in a Hudi table, then find similar
images with one SQL query using hudi_vector_search and read_blob"
+title: "Unstructured Data Quick Start"
+keywords: [ hudi, ai, machine learning, vector search, embeddings,
unstructured data, blob, image search, similarity]
+summary: "Store embeddings and images in a Hudi table, then find similar
images with hudi_vector_search and read_blob"
toc: true
last_modified_at: 2026-04-29T00:00:00-00:00
---
-This guide shows how Hudi 1.2.0 brings AI-native data types to the lakehouse.
You will store image
-embeddings (`VECTOR`) and raw image bytes (`BLOB`) in a single Hudi table,
then run a similarity
-search that finds the top-K nearest neighbors **and** materializes their
images — in one SQL query.
+AI and machine learning pipelines (RAG, recommendation, multimodal search)
need to store and query
+embeddings, raw bytes, and structured metadata side by side. Hudi's `VECTOR`
and `BLOB` column
+types let you keep all of those in a single transactional table. This guide
walks through them end
+to end: you will store image embeddings (`VECTOR`) and raw image bytes
(`BLOB`) in a single Hudi
+table, then run a top-K similarity search and materialize the matching images
in one SQL query.
:::tip
Want to try this locally? This guide is also available as an interactive
Jupyter notebook.
[Download the
notebook](https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb)
and run it end-to-end on your machine.
:::
-
+
*Example output: a query image of a German Shorthaired pointer (left) and the
five most similar images found by `hudi_vector_search`, with cosine similarity
scores. Raw image bytes are materialized by `read_blob()` directly in the same
query.*
@@ -45,7 +47,7 @@ HUDI_JAR = os.getenv("HUDI_BUNDLE_JAR",
"hudi-spark3.5-bundle_2.12-1.2.0.jar")
spark = (
SparkSession.builder
- .appName("Hudi-AI-QuickStart")
+ .appName("Hudi-Unstructured-Data-QuickStart")
.config("spark.jars", HUDI_JAR)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.extensions",
@@ -105,7 +107,7 @@ DIM = feats.shape[1] # 1024
## 3. Create table and insert data
-Declare `VECTOR(1024)` for embeddings and `BLOB` for raw image bytes —
first-class Hudi column types.
+Declare `VECTOR(1024)` for embeddings and `BLOB` for raw image bytes.
```python
from pyspark.sql import Row
@@ -155,9 +157,9 @@ SELECT image_id, category, label,
FROM staging;
```
-Key points:
-- **`VECTOR(1024)`** stores fixed-dimension embeddings for similarity search.
-- **`BLOB`** stores raw image bytes inline. For large objects, use
`OUT_OF_LINE` to store a pointer instead — `read_blob()` resolves both modes
transparently.
+Notes:
+- `VECTOR(1024)` stores fixed-dimension embeddings for similarity search.
+- `BLOB` stores raw image bytes inline. For large objects, use `OUT_OF_LINE`
to store a pointer instead. `read_blob()` resolves both modes transparently.
## 4. Materialize a BLOB with `read_blob()`
@@ -183,13 +185,14 @@ LIMIT 5;
+-----------+--------------------+----------+
```
-
+
-*Image bytes retrieved by `read_blob()`, decoded back to a PNG — confirming
lossless round-trip through the Hudi BLOB column.*
+*Image bytes retrieved by `read_blob()`, decoded back to a PNG. The round-trip
through the Hudi BLOB column is lossless.*
## 5. Vector search + BLOB retrieval in one query
-The headline: `hudi_vector_search` finds the top-K nearest neighbors by cosine
similarity, and `read_blob()` materializes image bytes **only for the matching
rows**.
+`hudi_vector_search` returns the top-K nearest neighbors by cosine similarity;
`read_blob()`
+materializes image bytes only for the matching rows.
```sql
SELECT image_id,
@@ -220,9 +223,7 @@ ORDER BY _hudi_distance;

-*Query: German Shorthaired pointer (left). Top-5 results ranked by cosine
similarity — the search correctly identifies same-breed images as the closest
matches.*
-
-One SQL query. No external vector database. Embeddings, raw images, and
metadata all live in one transactional Hudi table.
+*Query: German Shorthaired pointer (left). Top-5 results ranked by cosine
similarity.*
## 6. Visualize results
@@ -253,8 +254,8 @@ plt.savefig("hudi_vector_search_results.png", dpi=150)
| Topic | Link |
|:------|:-----|
| Full interactive notebook |
[00_main_demo.ipynb](https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb)
|
-| VECTOR type reference | [Vector Search](vector_search.md) — element types,
batch TVF, distance metrics |
-| BLOB type reference | [Unstructured Data](blob_unstructured_data.md) —
inline vs. out-of-line, `read_blob()`, configs |
-| VARIANT type | [Semi-Structured Data](variant_type.md) — flexible JSON-like
storage for LLM outputs, model metadata |
-| Lance file format | [Lance File Format](lance_file_format.md) —
vector-optimized storage |
-| AI overview | [AI-Native Lakehouse](ai_overview.md) — architecture and use
cases |
+| VECTOR type reference | [VECTOR in SQL DDL](sql_ddl.md#vector) · [SQL
DML](sql_dml.md#inserting-vector-columns) · [DataFrame
writes](writing_data.md#vector-via-dataframe) · [`hudi_vector_search` in SQL
Queries](sql_queries.md#vector-similarity-search) |
+| BLOB type reference | [BLOB in SQL DDL](sql_ddl.md#blob) · [SQL
DML](sql_dml.md#inserting-blob-columns) · [DataFrame
writes](writing_data.md#blob-via-dataframe) · [`read_blob()` in SQL
Queries](sql_queries.md#reading-blob-columns) |
+| VARIANT type | [VARIANT in SQL DDL](sql_ddl.md#variant) · [SQL
DML](sql_dml.md#inserting-variant-columns) · [DataFrame
writes](writing_data.md#variant-via-dataframe) · [Querying
VARIANT](sql_queries.md#querying-variant-columns) |
+| Lance file format | [Storage Layouts →
Lance](storage_layouts.md#lance-base-file-format) · [DataFrame
writes](writing_data.md#lance-base-file-format-via-dataframe) |
+| AI lakehouse use cases | [Use Cases → AI
Lakehouse](use_cases.md#ai-lakehouse) |
diff --git a/website/docs/use_cases.md b/website/docs/use_cases.md
index 251df6187611..d0b6cecdee11 100644
--- a/website/docs/use_cases.md
+++ b/website/docs/use_cases.md
@@ -97,7 +97,42 @@ For the more curious, a more detailed explanation of the
benefits of _incrementa
- The idea with MoR is to reduce write costs/latencies, by writing delta logs
(Hudi), positional delete files (iceberg). Hudi employs about 4 types of
indexing to quickly locate the file that the updates records belong to. Formats
relying on a scan of the table can quickly bottleneck on write performance. e.g
updating 1GB into a 1TB table every 5-10 mins.
- Hudi is the only lakehouse storage system that natively supports event time
ordering and late data handling for streaming workloads where MoR is employed
heavily.
+## AI-Native Data Lakehouse
+The rise of large language models (LLMs) and multimodal AI has sparked a
fundamental paradigm shift in data processing, moving the data lakehouse's
primary workload from traditional ETL/BI analytics toward feature engineering,
model training, and Retrieval-Augmented Generation (RAG). Traditional data
lakehouses were natively designed around structured, tabular data, leaving
modern AI applications awkwardly fragmented. These advanced workloads requires
seamlessly combining structured meta [...]
+Oftentimes, managing this multi-modal reality forced engineering teams to
maintain separate, siloed infrastructure. Raw binaries were stored in unmanaged
object stores, while embeddings were synced out to standalone, external vector
databases. This fragmented architecture introduced a severe "sync dilemma": it
creates multiple data copies, drives up operational infrastructure costs,
suffers from a lack of real-time responsiveness, and entirely breaks
transactional consistency across the [...]
+Apache Hudi delivers an AI-native lakehouse architecture by supporting
multi-modal data types within the same table schema. Large binary objects
([`BLOB`](sql_ddl.md#blob)), dense semantic embeddings
([`VECTOR`](sql_ddl.md#vector)), and flexible semi-structured payloads
([`VARIANT`](sql_ddl.md#variant)) reside natively within the same Hudi data
files. By managing multi-modal data inside the same lakehouse rather than
separate infrastructure, unstructured data inherits Hudi's ACID transac [...]
+The table below serves as the technical entry point for interacting with
AI-native data types in Hudi:
+
+| Type | Declare | Write | Read |
+|:-----|:--------|:------|:-----|
+| [`VECTOR`](sql_ddl.md#vector) | [DDL](sql_ddl.md#vector) | [SQL
DML](sql_dml.md#inserting-vector-columns) ·
[DataFrame](writing_data.md#vector-via-dataframe) | [`hudi_vector_search`
TVF](sql_queries.md#vector-similarity-search) |
+| [`BLOB`](sql_ddl.md#blob) | [DDL](sql_ddl.md#blob) | [SQL
DML](sql_dml.md#inserting-blob-columns) ·
[DataFrame](writing_data.md#blob-via-dataframe) |
[`read_blob()`](sql_queries.md#reading-blob-columns) |
+| [`VARIANT`](sql_ddl.md#variant) | [DDL](sql_ddl.md#variant) | [SQL
DML](sql_dml.md#inserting-variant-columns) ·
[DataFrame](writing_data.md#variant-via-dataframe) | [Querying
VARIANT](sql_queries.md#querying-variant-columns) |
+| [Lance file format](storage_layouts.md#lance-base-file-format) |
[TBLPROPERTIES](sql_ddl.md#lance-base-file-format) |
[DataFrame](writing_data.md#lance-base-file-format-via-dataframe) |
(transparent; Spark-only) |
+
+For a worked example covering image embeddings, raw bytes, and similarity
search, see the
+[Unstructured Data Quick Start Guide](unstructured-data-quick-start-guide.md).
+
+| Use case | Types and features |
+|:---------|:-------------------|
+| Image / video search | `VECTOR` embeddings + `BLOB` raw frames +
`hudi_vector_search` |
+| Retrieval-augmented generation (RAG) | `VECTOR` chunk embeddings retrieved
to assemble LLM context |
+| LLM output management | `VARIANT` for response payloads + `VECTOR` for
semantic indexing |
+| Recommendation systems | `VECTOR` similarity for collaborative filtering;
incremental re-embedding via incremental queries |
+| Content moderation | `BLOB` raw content + `VECTOR` content embeddings +
incremental processing |
+| Multimodal analytics | Structured metadata + `VECTOR` embeddings + `BLOB`
raw data in one table |
+| ML feature store | `VECTOR` for feature embeddings, `VARIANT` for sparse
feature maps; time-travel for point-in-time retrieval |
+| Experiment tracking | `VARIANT` for heterogeneous model configs and metrics;
incremental queries for latest runs |
+| Data labeling pipelines | `BLOB` for raw data, incremental queries for
unlabeled data, transactional label updates |
+
+### Why Hudi for AI workloads
+
+- Unified storage across modalities: Raw binary objects
([`BLOB`](sql_ddl.md#blob)), semantic vector embeddings
([`VECTOR`](sql_ddl.md#vector)), semi-structured payloads
([`VARIANT`](sql_ddl.md#variant)), and standard analytical columns live
together in a single Hudi table, under the same [timeline](timeline.md),
[schema evolution](schema_evolution.md), and access controls.
+- Incremental embedding pipelines: Hudi's [incremental
query](sql_queries.md#incremental-query) lets feature pipelines re-embed only
the rows that changed since a given commit, instead of reprocessing the full
batch.
+- Multi-modal transactional guarantees: Vector updates, binary changes, and
metadata modifications commit atomically. A similarity query joining the
`VECTOR` column with the `BLOB` column never observes a state in which the
embedding refers to a deleted or stale binary asset.
+- Automated table services: [File sizing](file_sizing.md),
[cleaning](cleaning.md), [compaction](compaction.md), and
[clustering](clustering.md) apply to tables that contain `VECTOR`, `BLOB`, and
`VARIANT` columns.
+- Open, multi-engine access: Hudi tables are written and read by [Apache
Spark](quick-start-guide.md) and [Apache Flink](flink-quick-start-guide.md),
and by the upcoming native [Python / Rust client
(hudi-rs)](python-rust-quick-start-guide.md) reading the underlying Parquet
representation directly.
diff --git a/website/docs/variant_type.md b/website/docs/variant_type.md
deleted file mode 100644
index fdcf6f54b54b..000000000000
--- a/website/docs/variant_type.md
+++ /dev/null
@@ -1,362 +0,0 @@
----
-title: "Semi-Structured Data (VARIANT)"
-keywords: [ hudi, variant, semi-structured, json, schemaless, shredding,
parse_json, flexible schema]
-summary: "Store and query semi-structured JSON-like data in Hudi tables using
the VARIANT type, with optional shredding for query performance"
-toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
----
-
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-The `VARIANT` type stores semi-structured, JSON-like data directly in Hudi
tables. Unlike rigid schemas
-where every column and type must be declared upfront, VARIANT columns accept
arbitrary nested structures —
-objects, arrays, scalars — and store them efficiently in a self-describing
binary format.
-
-This is particularly useful for AI and data engineering workloads where data
shapes evolve rapidly:
-model metadata, LLM outputs, feature dictionaries, API responses, and event
payloads.
-
-## Overview
-
-VARIANT represents semi-structured data as two binary fields:
-
-| Field | Description |
-|:------|:------------|
-| `metadata` | Encodes field names, types, and structure for efficient access |
-| `value` | The actual data payload |
-
-This encoding supports any JSON-compatible value: objects, arrays, strings,
numbers, booleans, and null.
-Unlike storing raw JSON strings, VARIANT provides type-aware binary storage
with optional **shredding**
-for columnar query performance.
-
-## Creating Tables with VARIANT Columns
-
-<Tabs
-groupId="api-style"
-defaultValue="sql4"
-values={[
-{ label: 'Spark SQL (4.0+)', value: 'sql4', },
-{ label: 'Spark SQL (3.x)', value: 'sql3', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql4">
-
-Spark 4.0+ has native `VARIANT` type support:
-
-```sql
-CREATE TABLE events (
- event_id STRING,
- payload VARIANT,
- ts BIGINT
-) USING hudi
-TBLPROPERTIES (
- primaryKey = 'event_id',
- preCombineField = 'ts'
-);
-```
-
-</TabItem>
-<TabItem value="sql3">
-
-On Spark 3.x, use the struct representation with metadata tagging:
-
-```sql
-CREATE TABLE events (
- event_id STRING,
- payload STRUCT<value: BINARY, metadata: BINARY>,
- ts BIGINT
-) USING hudi
-TBLPROPERTIES (
- primaryKey = 'event_id',
- preCombineField = 'ts'
-);
-```
-
-Hudi recognizes this struct pattern and treats it as a logical VARIANT.
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-from pyspark.sql.types import *
-
-# Option 1: Spark 4.0+ native VariantType
-# schema = StructType([
-# StructField("event_id", StringType()),
-# StructField("payload", VariantType()),
-# StructField("ts", LongType()),
-# ])
-
-# Option 2: Struct with metadata tag (works on Spark 3.x and 4.0+)
-from pyspark.sql.types import MetadataBuilder
-variant_metadata = MetadataBuilder() \
- .putString("hudi_type", "VARIANT") \
- .build()
-variant_struct = StructType([
- StructField("metadata", BinaryType()),
- StructField("value", BinaryType()),
-])
-schema = StructType([
- StructField("event_id", StringType()),
- StructField("payload", variant_struct, metadata=variant_metadata),
- StructField("ts", LongType()),
-])
-```
-
-</TabItem>
-</Tabs>
-
-## Writing VARIANT Data
-
-<Tabs
-groupId="api-style"
-defaultValue="sql4"
-values={[
-{ label: 'Spark SQL (4.0+)', value: 'sql4', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql4">
-
-Use `parse_json()` to convert JSON strings into VARIANT values:
-
-```sql
-INSERT INTO events VALUES
- ('evt_001', parse_json('{"action": "click", "x": 120, "y": 450}'), 1000),
- ('evt_002', parse_json('{"action": "purchase", "items": ["sku_a",
"sku_b"], "total": 59.99}'), 1001),
- ('evt_003', parse_json('"simple string value"'), 1002);
-```
-
-VARIANT accepts any valid JSON: objects, arrays, strings, numbers, booleans,
and null.
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-df = spark.sql("""
- SELECT
- 'evt_001' as event_id,
- parse_json('{"action": "click", "x": 120, "y": 450}') as payload,
- 1000 as ts
-""")
-
-df.write.format("hudi") \
- .option("hoodie.table.name", "events") \
- .option("hoodie.datasource.write.recordkey.field", "event_id") \
- .option("hoodie.datasource.write.precombine.field", "ts") \
- .mode("append") \
- .save("/path/to/events")
-```
-
-</TabItem>
-</Tabs>
-
-## Querying VARIANT Data
-
-### Reading VARIANT as JSON
-
-Cast VARIANT to string for JSON output:
-
-```sql
-SELECT event_id, cast(payload as STRING) as payload_json
-FROM events;
-```
-
-```
-+----------+--------------------------------------------------------------+
-| event_id| payload_json|
-+----------+--------------------------------------------------------------+
-| evt_001| {"action":"click","x":120,"y":450} |
-| evt_002| {"action":"purchase","items":["sku_a","sku_b"],"total":59.99}|
-| evt_003| "simple string value" |
-+----------+--------------------------------------------------------------+
-```
-
-### Full DML Support
-
-VARIANT columns support all standard DML operations:
-
-```sql
--- UPDATE
-UPDATE events SET payload = parse_json('{"action": "click", "x": 200}')
-WHERE event_id = 'evt_001';
-
--- DELETE
-DELETE FROM events WHERE event_id = 'evt_003';
-
--- MERGE
-MERGE INTO events target
-USING new_events source ON target.event_id = source.event_id
-WHEN MATCHED THEN UPDATE SET payload = source.payload, ts = source.ts
-WHEN NOT MATCHED THEN INSERT *;
-```
-
-### Table Type Support
-
-VARIANT works with both COW (Copy-on-Write) and MOR (Merge-on-Read) table
types. MOR tables
-generate log files for VARIANT changes, which are merged during compaction.
-
-## VARIANT Shredding
-
-Shredding extracts frequently-accessed fields from VARIANT into dedicated
typed Parquet columns. This
-gives columnar query performance for known fields while retaining the
flexibility of VARIANT for
-the rest.
-
-### Unshredded (Default)
-
-```
-group variant {
- required binary metadata;
- required binary value;
-}
-```
-
-All data is in the opaque binary fields. Every query must decode the full
VARIANT.
-
-### Shredded (Scalar)
-
-```
-group variant {
- required binary metadata;
- optional binary value;
- optional int64 typed_value; -- extracted scalar field
-}
-```
-
-The `typed_value` column stores a typed extraction that Spark can read
directly without decoding
-the binary payload.
-
-### Shredded (Object)
-
-```
-group variant {
- required binary metadata;
- optional binary value;
- optional group typed_value {
- optional group action {
- optional binary value;
- optional string typed_value; -- "click", "purchase", etc.
- }
- optional group total {
- optional binary value;
- optional double typed_value; -- 59.99
- }
- }
-}
-```
-
-Each known field gets its own sub-column. Fields not present in a given row
fall back to the
-binary `value` field.
-
-## Cross-Engine Compatibility
-
-| Engine | VARIANT Support |
-|:-------|:---------------|
-| **Spark 4.0** | Native `VariantType` — full read/write/query for COW and
MOR; native `df.write` with `VariantType` on the V1 datasource |
-| **Spark 4.1** | Native `VariantType` — full read/write/query for COW and MOR
|
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` —
backward compatible |
-| **Flink** | Native `VARIANT` operations are not supported. Tables written by
Spark with VARIANT columns can be read in Flink only as the underlying
`ROW<metadata BYTES, value BYTES>` struct. |
-
-A VARIANT table written by Spark 4.0/4.1 can be read by Spark 3.x using the
underlying binary struct, or by Flink as `ROW<metadata BYTES, value BYTES>`.
The binary encoding is engine-independent.
-
-## Metastore Sync
-
-When syncing VARIANT column schemas to external catalogs, Hudi maps the binary
encoding to the
-target catalog's native struct type:
-
-| Catalog | VARIANT representation |
-|:--------|:----------------------|
-| Hive | `STRUCT<metadata:BINARY, value:BINARY>` |
-| BigQuery | `STRUCT` with `metadata` and `value` fields (`BYTES` type) |
-
-Query engines that support VARIANT (Spark 4.0+, Flink 2.1+) read the table
directly using the
-Parquet VARIANT annotation and do not go through the Hive/BigQuery metastore
representation.
-
-## Use Cases for AI Workloads
-
-### LLM Output Storage
-
-Store raw LLM responses with varying structures:
-
-```sql
-CREATE TABLE llm_outputs (
- request_id STRING,
- model STRING,
- response VARIANT,
- tokens_used INT,
- ts BIGINT
-) USING hudi TBLPROPERTIES (primaryKey = 'request_id', preCombineField = 'ts');
-
-INSERT INTO llm_outputs VALUES (
- 'req_001', 'claude-sonnet',
- parse_json('{"text": "...", "stop_reason": "end_turn", "usage": {"input":
500, "output": 200}}'),
- 700, 1000
-);
-```
-
-### Model Metadata & Experiment Tracking
-
-Store heterogeneous model configurations and metrics:
-
-```sql
-CREATE TABLE experiments (
- run_id STRING,
- model_config VARIANT, -- hyperparameters, architecture, etc.
- metrics VARIANT, -- loss, accuracy, custom metrics
- ts BIGINT
-) USING hudi TBLPROPERTIES (primaryKey = 'run_id', preCombineField = 'ts');
-```
-
-### Feature Dictionaries
-
-Store sparse or variable-length feature maps for ML:
-
-```sql
-CREATE TABLE user_features (
- user_id STRING,
- features VARIANT, -- {"age": 25, "interests": [...], "embedding":
[...]}
- updated_at BIGINT
-) USING hudi TBLPROPERTIES (primaryKey = 'user_id', preCombineField =
'updated_at');
-```
-
-### API Response Archival
-
-Ingest and store API responses with evolving schemas:
-
-```sql
-CREATE TABLE api_responses (
- request_id STRING,
- endpoint STRING,
- status_code INT,
- body VARIANT,
- ts BIGINT
-) USING hudi TBLPROPERTIES (primaryKey = 'request_id', preCombineField = 'ts');
-```
-
-## Best Practices
-
-1. **Use VARIANT for evolving schemas** — When the data shape changes
frequently or varies across
- records, VARIANT avoids constant schema migrations.
-
-2. **Prefer typed columns for frequently-queried fields** — If you always
filter on a specific field,
- make it a top-level typed column rather than burying it in VARIANT.
Combine: use typed columns for
- stable fields and VARIANT for the rest.
-
-3. **Consider shredding for hot fields** — If certain VARIANT fields are
queried heavily, shredding
- extracts them into columnar storage for better performance.
-
-4. **Use `parse_json()` for ingestion** — On Spark 4.0+, `parse_json()` is the
standard way to create
- VARIANT values from JSON strings.
-
-5. **Use `cast(v as STRING)` for export** — Convert VARIANT back to JSON for
downstream consumers
- that expect text.
-
-## Limitations
-
-- Native `VARIANT` keyword in DDL requires Spark 4.0+. On Spark 3.x, use the
struct representation.
-- VARIANT shredding configuration is determined at write time based on the
schema definition.
-- Complex path expressions within VARIANT may require casting to STRING and
then using JSON functions.
-- Native VARIANT operations are not supported on Flink. VARIANT columns
surface as `ROW<metadata BYTES, value BYTES>` and can be read but not natively
decoded or queried as a variant.
-- VARIANT columns are **not supported** on Lance-backed tables. Use Parquet as
the base file format
- for tables containing VARIANT columns.
diff --git a/website/docs/vector_search.md b/website/docs/vector_search.md
deleted file mode 100644
index 4e6687057a6c..000000000000
--- a/website/docs/vector_search.md
+++ /dev/null
@@ -1,274 +0,0 @@
----
-title: "Vector Search"
-keywords: [ hudi, vector, search, embeddings, similarity, cosine, nearest
neighbor, VECTOR type]
-summary: "Store embedding vectors in Hudi tables and run vector similarity
search using the VECTOR type and hudi_vector_search TVF"
-toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
----
-
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-Hudi's `VECTOR` type and `hudi_vector_search` table-valued function (TVF)
bring native similarity search
-to the data lakehouse. Store embeddings alongside your structured data and
query them with familiar Spark SQL —
-no external vector database required.
-
-## Storage Format
-
-VECTOR columns are stored in Parquet as `FIXED_LEN_BYTE_ARRAY` — a
fixed-length binary encoding of the
-float array. Hudi stamps `hudi_type` metadata on the column so the Spark
reader knows to decode the
-bytes back into a typed array.
-
-On **Lance** tables, VECTOR columns are stored natively as Lance
`FixedSizeList<Float32/Float64, dim>`,
-so embeddings are written without conversion overhead at the file-format
layer. See
-[Lance File Format](lance_file_format.md) for details.
-
-The `VECTOR(dim[, elementType])` DDL syntax works across Spark 3.4, 3.5, 4.0,
and 4.1. Hudi's SQL
-parser normalizes `VECTOR(128, FLOAT)` to `VECTOR(128)` (FLOAT is the default
element type).
-Nesting VECTOR inside STRUCT, ARRAY, or MAP is not supported.
-
-## VECTOR Type
-
-The `VECTOR(dim[, elementType])` type declares a column that stores
fixed-dimensional embedding vectors.
-Dimension metadata enables the query engine to validate inputs and optimize
search.
-
-### Element Types
-
-| Element Type | Description | Storage | Use Case |
-|:-------------|:------------|:--------|:---------|
-| **FLOAT** (default) | 32-bit float | `ArrayType(FloatType)` | Standard
embeddings (OpenAI, Cohere, etc.) |
-| **DOUBLE** | 64-bit double | `ArrayType(DoubleType)` | High-precision
scientific embeddings |
-| **INT8** / **BYTE** | 8-bit signed integer | `ArrayType(ByteType)` |
Quantized embeddings for storage efficiency |
-
-```sql
--- Default (FLOAT)
-embedding VECTOR(768)
-
--- Explicit element types
-embedding VECTOR(768, FLOAT)
-embedding VECTOR(768, DOUBLE)
-embedding VECTOR(256, INT8)
-```
-
-### Declaring VECTOR Columns
-
-<Tabs
-groupId="api-style"
-defaultValue="sql"
-values={[
-{ label: 'Spark SQL', value: 'sql', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql">
-
-```sql
-CREATE TABLE products (
- product_id STRING,
- name STRING,
- description STRING,
- embedding VECTOR(768)
-) USING hudi
-TBLPROPERTIES (
- primaryKey = 'product_id',
- type = 'cow',
- hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
- hoodie.datasource.write.base.file.format = 'parquet'
-);
-```
-
-When using SQL DDL, Hudi's parser automatically stamps the `VECTOR(dim)`
metadata on the column.
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-import pyarrow as pa
-
-schema = pa.schema([
- pa.field("product_id", pa.string()),
- pa.field("name", pa.string()),
- pa.field("embedding", pa.list_(pa.float32()),
- metadata={b"hudi_type": b"VECTOR(768)"}),
-])
-```
-
-When using the DataFrame API, you must manually stamp `hudi_type` metadata on
the column via PyArrow.
-This metadata is what distinguishes a `VECTOR` column from a regular array
column.
-
-</TabItem>
-</Tabs>
-
-### Writing Vectors
-
-Vectors are written as arrays of floats. Both the DataFrame API and SQL accept
standard array syntax:
-
-```sql
-INSERT INTO products VALUES (
- 'prod_001', 'Running Shoes', 'Lightweight trail runner',
- ARRAY(0.123, -0.456, 0.789, ...) -- 768 floats
-);
-```
-
-## hudi_vector_search TVF
-
-The `hudi_vector_search` table-valued function returns the `top_k` rows from a
Hudi table whose
-VECTOR column is closest to a given query vector under a chosen distance
metric.
-
-### Syntax
-
-```sql
-SELECT *
-FROM hudi_vector_search(
- table_name, -- STRING: name of the Hudi table
- vector_column, -- STRING: name of the VECTOR column
- query_vector, -- ARRAY<FLOAT>: the query embedding
- top_k, -- INT: number of nearest neighbors to return
- [distance_metric], -- STRING: 'cosine' (default), 'l2', or 'dot_product'
- [algorithm] -- STRING: 'brute_force' (default)
-)
-```
-
-### Parameters
-
-| Parameter | Type | Default | Description |
-|:----------|:-----|:--------|:------------|
-| `table_name` | STRING | (required) | The Hudi table to search. Can be a
registered table name or a path. |
-| `vector_column` | STRING | (required) | The name of the VECTOR column to
search against. |
-| `query_vector` | ARRAY<FLOAT> | (required) | The query embedding. Must
match the declared dimension and element type of the VECTOR column. |
-| `top_k` | INT | (required) | The number of nearest neighbors to return. Must
be a positive integer. |
-| `distance_metric` | STRING | `'cosine'` | Distance metric: `'cosine'`,
`'l2'`, or `'dot_product'`. |
-| `algorithm` | STRING | `'brute_force'` | Search algorithm. Currently only
`'brute_force'` is supported. |
-
-### Return Schema
-
-The TVF returns all columns from the source table (excluding the embedding
column) plus:
-
-| Column | Type | Description |
-|:-------|:-----|:------------|
-| `_hudi_distance` | DOUBLE | The computed distance between the query vector
and each result. Lower values indicate greater similarity. |
-
-Results are ordered by `_hudi_distance` ascending — closest matches first.
-
-## hudi_vector_search_batch TVF
-
-For searching with multiple query vectors at once, use the batch variant:
-
-### Syntax
-
-```sql
-SELECT *
-FROM hudi_vector_search_batch(
- corpus_table, -- STRING: table to search
- corpus_embedding_col, -- STRING: VECTOR column in corpus
- query_table, -- STRING: table containing query vectors
- query_embedding_col, -- STRING: VECTOR column in query table
- top_k, -- INT: neighbors per query
- [distance_metric], -- STRING: 'cosine' (default)
- [algorithm] -- STRING: 'brute_force' (default)
-)
-```
-
-### Return Schema (Batch)
-
-Returns corpus columns + query columns + distance info:
-
-| Column | Type | Description |
-|:-------|:-----|:------------|
-| `_hudi_distance` | DOUBLE | Distance between query and corpus vector |
-| `_hudi_query_index` | LONG | Index identifying which query vector produced
this result |
-
-If corpus and query tables share column names, query columns are prefixed with
`_hudi_query_`.
-
-### Distance Metrics
-
-| Metric | Formula | Range | Best for |
-|:-------|:--------|:------|:---------|
-| **cosine** | 1 - cos(a, b), clamped to [0, 2] | [0, 2] | Normalized
embeddings (most common). Returns 1.0 for zero vectors. |
-| **l2** | sqrt(sum((a[i] - b[i])^2)) | [0, +inf) | Raw (unnormalized)
embeddings |
-| **dot_product** | -(a · b) | (-inf, +inf) | Maximum inner product
search. Negated so ascending sort = most similar. |
-
-:::tip
-For best results with cosine distance, **L2-normalize your embeddings** before
writing them to the table.
-Most embedding models (OpenAI, Cohere, sentence-transformers) output
normalized vectors by default.
-If yours does not, normalize during ingestion:
-
-```python
-embedding = embedding / np.linalg.norm(embedding)
-```
-:::
-
-### Examples
-
-**Find similar products:**
-
-```sql
-SELECT product_id, name, _hudi_distance AS distance
-FROM hudi_vector_search(
- 'products', 'embedding',
- ARRAY(0.12, -0.03, 0.87, ...), -- query embedding
- 10, -- top 10
- 'cosine'
-)
-ORDER BY distance;
-```
-
-**RAG context retrieval:**
-
-```sql
--- Retrieve the 5 most relevant document chunks for an LLM prompt
-SELECT chunk_id, text_content, _hudi_distance
-FROM hudi_vector_search(
- 'document_chunks', 'embedding',
- ARRAY(...), -- embedding of the user's question
- 5, 'cosine'
-)
-WHERE _hudi_distance < 0.3; -- optional distance threshold
-```
-
-**Cross-modal search (text-to-image):**
-
-```sql
--- Using CLIP embeddings, find images matching a text query
-SELECT image_id, caption, _hudi_distance
-FROM hudi_vector_search(
- 'image_catalog', 'clip_embedding',
- ARRAY(...), -- text embedding from CLIP
- 20, 'cosine'
-);
-```
-
-## Best Practices
-
-1. **Normalize embeddings** — Pre-normalize embeddings (L2 norm = 1) for
cosine distance. This yields
- more consistent results and slightly faster search.
-
-2. **Right-size your dimensions** — Higher dimensions capture more information
but increase storage and
- search cost. Many use cases work well with 384–1024 dimensions.
-
-3. **Use incremental processing** — When new data arrives, only embed and
write the new records.
- Hudi's incremental query capabilities make this straightforward.
-
-## Constraints
-
-- VECTOR columns must be **top-level fields** — nesting inside STRUCT, ARRAY,
or MAP is not supported.
-- The query vector's element type must **exactly match** the corpus
embedding's element type (no implicit casting).
-- VECTOR dimension and element type **cannot be changed** after table creation
via schema evolution.
-- **Flink cannot read VECTOR columns.** VECTOR data is stored as Parquet
`FIXED_LEN_BYTE_ARRAY`, which
- Flink's Parquet reader does not decode back into a typed array. Flink can
still read all **other**
- columns in a table that contains a VECTOR column — only the VECTOR column
itself is inaccessible.
- Use Spark to query VECTOR columns.
-
-## Metastore Sync
-
-When syncing VECTOR column schemas to external catalogs, Hudi maps the binary
encoding to the
-target catalog's native binary type, preserving the original VECTOR metadata
in table properties:
-
-| Catalog | VECTOR representation |
-|:--------|:---------------------|
-| Hive | `BINARY` |
-| BigQuery | `BYTES` |
-
-The `VECTOR(dim, elementType)` dimension and element-type metadata is
preserved in
-`TBLPROPERTIES`/table descriptions so the table can be correctly reconstructed
by Spark even after
-a metastore round-trip.
diff --git a/website/docs/writing_data.md b/website/docs/writing_data.md
index dcdc5a33bd81..a0a3e1ca985c 100644
--- a/website/docs/writing_data.md
+++ b/website/docs/writing_data.md
@@ -418,6 +418,137 @@ Then any record you want to delete you can mark
`_hoodie_is_deleted` as true:
{"ts": 0.0, "uuid": "19tdb048-c93e-4532-adf9-f61ce6afe10", "rank": 1045,
"partitionpath": "americas/brazil/sao_paulo", "_hoodie_is_deleted" : true}
```
+### Writing VECTOR, BLOB, and VARIANT Columns
+
+`VECTOR`, `BLOB`, and `VARIANT` columns can be written using SQL `INSERT`, see
[SQL DML](sql_dml.md#inserting-vector-columns); the
+DataFrame API equivalents are below.
+
+#### VECTOR via DataFrame
+
+Add `hudi_type` metadata on the VECTOR column so the writer recognizes it:
+
+```python
+import pyarrow as pa
+
+schema = pa.schema([
+ pa.field("product_id", pa.string()),
+ pa.field("embedding", pa.list_(pa.float32()),
+ metadata={b"hudi_type": b"VECTOR(768)"}),
+])
+```
+
+#### BLOB via DataFrame
+
+A BLOB column is internally a struct (see [BLOB](sql_ddl.md#blob)). Build it
as a Spark `Row`:
+
+```python
+from pyspark.sql import Row
+
+with open("logo.png", "rb") as f:
+ raw_bytes = f.read()
+
+row = Row(
+ asset_id="asset_001",
+ file_name="logo.png",
+ mime_type="image/png",
+ file_size=len(raw_bytes),
+ content=Row(type="INLINE", data=raw_bytes, reference=None),
+)
+```
+
+For PyArrow schemas, declare the struct explicitly:
+
+```python
+import pyarrow as pa
+
+schema = pa.schema([
+ pa.field("asset_id", pa.string()),
+ pa.field("file_name", pa.string()),
+ pa.field("mime_type", pa.string()),
+ pa.field("file_size", pa.int64()),
+ pa.field("content", pa.struct([
+ pa.field("type", pa.string()),
+ pa.field("data", pa.binary()),
+ pa.field("reference", pa.struct([
+ pa.field("external_path", pa.string()),
+ pa.field("offset", pa.int64()),
+ pa.field("length", pa.int64()),
+ pa.field("managed", pa.bool_()),
+ ])),
+ ]), metadata={b"hudi_type": b"BLOB"}),
+])
+```
+
+#### VARIANT via DataFrame
+
+VARIANT writes via the DataFrame API require Spark 4.0+. Use native
`VariantType`:
+
+```python
+from pyspark.sql.types import StructType, StructField, StringType, LongType,
VariantType
+
+schema = StructType([
+ StructField("event_id", StringType()),
+ StructField("payload", VariantType()),
+ StructField("ts", LongType()),
+])
+```
+
+Alternatively, declare the underlying struct and tag the outer field with
`hudi_type=VARIANT`. The
+struct must contain exactly two **non-nullable** `BinaryType` fields named
`metadata` and `value`,
+otherwise the writer throws `IllegalArgumentException: Invalid variant schema
structure`. The column
+round-trips as native `VariantType` on read:
+
+```python
+from pyspark.sql.types import StructType, StructField, StringType, LongType,
BinaryType, MetadataBuilder
+
+variant_metadata = MetadataBuilder().putString("hudi_type", "VARIANT").build()
+variant_struct = StructType([
+ StructField("metadata", BinaryType(), nullable=False),
+ StructField("value", BinaryType(), nullable=False),
+])
+schema = StructType([
+ StructField("event_id", StringType()),
+ StructField("payload", variant_struct, metadata=variant_metadata),
+ StructField("ts", LongType()),
+])
+```
+
+The simplest path for constructing VARIANT values from JSON strings is to
build the DataFrame via
+SQL and then write it:
+
+```python
+df = spark.sql("""
+ SELECT 'evt_001' AS event_id,
+ parse_json('{"action": "click", "x": 120, "y": 450}') AS payload,
+ 1000 AS ts
+""")
+df.write.format("hudi") \
+ .option("hoodie.table.name", "events") \
+ .option("hoodie.datasource.write.recordkey.field", "event_id") \
+ .option("hoodie.datasource.write.precombine.field", "ts") \
+ .mode("append") \
+ .save("/path/to/table")
+```
+
+#### Lance base file format via DataFrame
+
+Set `hoodie.table.base.file.format=lance` on the write options:
+
+```python
+(df.write
+ .format("hudi")
+ .option("hoodie.table.name", "my_ai_table")
+ .option("hoodie.datasource.write.recordkey.field", "id")
+ .option("hoodie.record.merger.impls",
+ "org.apache.hudi.DefaultSparkRecordMerger")
+ .option("hoodie.table.base.file.format", "lance")
+ .mode("overwrite")
+ .save("/path/to/my_ai_table"))
+```
+
+See [Storage Layouts → Lance](storage_layouts.md#lance-base-file-format) for
full Lance behavior
+and configs.
+
### Concurrency Control
Following is an example of how to use `optimistic_concurrency_control` via
Spark DataSource API.
@@ -467,6 +598,8 @@ The following advanced storage configuration options were
added in Hudi 1.2.0:
| Config | Default | Description |
|---|---|---|
| `hoodie.parquet.write.config.injector.class` | (none) | Fully-qualified
class name of a custom `HoodieParquetConfigInjector` implementation. Use this
to inject custom Parquet writer properties (e.g., disable dictionary encoding,
set bloom filter sizes) without modifying the Hudi source. The implementing
class must implement `org.apache.hudi.io.HoodieParquetConfigInjector`. |
+| `hoodie.table.base.file.format` | `parquet` | Base file format for the
table. Accepts `parquet`, `orc`, `hfile`, or `lance`. See [Storage Layouts →
Lance](storage_layouts.md#lance-base-file-format) for the Lance-specific
options. |
+
## Java Client
We can use plain java to write to hudi tables. To use Java client we can
refere
[here](https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-java/src/main/java/org/apache/hudi/examples/java/HoodieJavaWriteClientExample.java)
diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js
index f0e054a7c1e4..48d9028f3659 100644
--- a/website/docusaurus.config.js
+++ b/website/docusaurus.config.js
@@ -144,6 +144,10 @@ module.exports = {
from: ["/docs/next/indexing"],
to: "/docs/next/indexes",
},
+ {
+ from: ["/docs/next/ai-quick-start-guide"],
+ to: "/docs/next/unstructured-data-quick-start-guide",
+ },
{
from: ["/docs/releases", "/docs/next/releases"],
to: "/releases/release-1.1",
diff --git a/website/sidebars.js b/website/sidebars.js
index b4e8379b571b..36cffe3aead2 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -16,23 +16,12 @@ module.exports = {
'quick-start-guide',
'flink-quick-start-guide',
'python-rust-quick-start-guide',
- 'ai-quick-start-guide',
+ 'unstructured-data-quick-start-guide',
'docker_demo',
"notebooks",
'use_cases',
],
},
- {
- type: 'category',
- label: 'AI & Unstructured Data',
- items: [
- 'ai_overview',
- 'vector_search',
- 'blob_unstructured_data',
- 'variant_type',
- 'lance_file_format',
- ],
- },
{
type: 'category',
label: 'Design & Concepts',