(hudi) branch asf-site updated: docs: Restructure and revise 1.2.0 Unstructured Data pages (#18876)

yihua Thu, 28 May 2026 23:19:07 -0700

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new c1199fed37d3 docs: Restructure and revise 1.2.0 Unstructured Data 
pages (#18876)
c1199fed37d3 is described below

commit c1199fed37d303be4e5a6a32565b9668c3086b48
Author: Y Ethan Guo <[email protected]>
AuthorDate: Thu May 28 23:18:26 2026 -0700

    docs: Restructure and revise 1.2.0 Unstructured Data pages (#18876)
---
 website/docs/ai_overview.md                        | 188 -----------
 website/docs/blob_unstructured_data.md             | 350 --------------------
 website/docs/ingestion_flink.md                    |  20 ++
 website/docs/lance_file_format.md                  | 220 -------------
 website/docs/overview.mdx                          |  13 +-
 website/docs/sql_ddl.md                            | 192 ++++++++++-
 website/docs/sql_dml.md                            |  75 +++++
 website/docs/sql_queries.md                        | 182 ++++++++++-
 website/docs/storage_layouts.md                    |  92 ++++++
 website/docs/syncing_metastore.md                  |  21 ++
 ...e.md => unstructured-data-quick-start-guide.md} |  47 +--
 website/docs/use_cases.md                          |  35 ++
 website/docs/variant_type.md                       | 362 ---------------------
 website/docs/vector_search.md                      | 274 ----------------
 website/docs/writing_data.md                       | 133 ++++++++
 website/docusaurus.config.js                       |   4 +
 website/sidebars.js                                |  13 +-
 17 files changed, 762 insertions(+), 1459 deletions(-)

diff --git a/website/docs/ai_overview.md b/website/docs/ai_overview.md
deleted file mode 100644
index c28bcae6543a..000000000000
--- a/website/docs/ai_overview.md
+++ /dev/null
@@ -1,188 +0,0 @@
----
-title: "AI-Native Lakehouse"
-keywords: [ hudi, ai, machine learning, vector search, embeddings, 
unstructured data, blob, lance, multimodal, lakehouse]
-summary: "Apache Hudi brings first-class AI and unstructured data support to 
the data lakehouse — store embeddings, images, audio, and run vector search 
natively"
-toc: true
-last_modified_at: 2026-04-25T00:00:00-00:00
----
-
-Modern AI workloads demand more from the data lakehouse than traditional 
analytics ever did. Teams building
-retrieval-augmented generation (RAG) pipelines, recommendation systems, 
content moderation, and multimodal
-search need to store and query **embeddings**, **images**, **audio**, 
**video**, and **documents** alongside
-structured data — all with the transactional guarantees, incremental 
processing, and table services that
-Hudi is known for.
-
-Apache Hudi's AI-native capabilities bring this vision to life with four 
foundational building blocks:
-
-## Core Capabilities
-
-### VECTOR Type and Similarity Search
-
-Store high-dimensional embedding vectors as first-class column types and run 
vector similarity
-search directly in Spark SQL.
-
-```sql
--- Declare an embedding column
-CREATE TABLE products (
-    id        STRING,
-    name      STRING,
-    embedding VECTOR(768)
-) USING hudi;
-
--- Find the 10 most similar products
-SELECT id, name, _hudi_distance
-FROM hudi_vector_search('products', 'embedding', ARRAY(...), 10, 'cosine');
-```
-
-No external vector database required. Your embeddings live in the same table 
as your structured data, governed
-by the same transactions, schemas, and access controls.
-
-**[Learn more about Vector Search &#8594;](vector_search.md)**
-
-### BLOB Type for Unstructured Data
-
-Store raw binary data (images, PDFs, audio clips, model weights) directly in 
Hudi tables using the `BLOB` type.
-Choose between two storage modes:
-
-| Mode | How it works | Best for |
-|:-----|:-------------|:---------|
-| **Inline** | Bytes stored directly in the table row | Small objects 
(thumbnails, short audio clips) |
-| **Out-of-line** | Table stores a pointer; `read_blob()` resolves it on 
demand | Large objects (high-res images, video, model checkpoints) |
-
-```sql
-CREATE TABLE documents (
-    doc_id   STRING,
-    content  BLOB,
-    summary  STRING
-) USING hudi;
-
--- Read the raw bytes on demand
-SELECT doc_id, read_blob(content) AS raw_bytes FROM documents WHERE doc_id = 
'doc_001';
-```
-
-Out-of-line BLOBs keep your table footprint small (often less than 1% of total 
data size) while maintaining
-full queryability.
-
-**[Learn more about Unstructured Data &#8594;](blob_unstructured_data.md)**
-
-### VARIANT Type for Semi-Structured Data
-
-AI pipelines often deal with data whose shape is not known in advance: LLM 
outputs, model metadata,
-feature dictionaries, API responses. The `VARIANT` type stores 
semi-structured, JSON-like data with
-full transactional support — no rigid schema required.
-
-```sql
-CREATE TABLE llm_outputs (
-    request_id  STRING,
-    response    VARIANT,
-    ts          BIGINT
-) USING hudi;
-
--- Store any JSON structure
-INSERT INTO llm_outputs VALUES (
-    'req_001',
-    parse_json('{"text": "...", "stop_reason": "end_turn", "tokens": 700}'),
-    1000
-);
-
--- Query back as JSON
-SELECT request_id, cast(response as STRING) FROM llm_outputs;
-```
-
-VARIANT supports optional **shredding** to extract hot fields into typed 
columnar storage for better
-query performance, while keeping the flexibility for everything else.
-
-**[Learn more about Semi-Structured Data &#8594;](variant_type.md)**
-
-### Lance File Format
-
-Hudi's pluggable file format architecture supports **Lance**, a modern 
columnar format purpose-built for
-AI/ML workloads. Lance provides:
-
-- Native vector column encoding (`FixedSizeList`) — no conversion overhead at 
the file-format layer
-- Fast random access for training data sampling
-- Optimized storage for high-dimensional arrays and nested structures
-
-```sql
-CREATE TABLE embeddings (...) USING hudi
-TBLPROPERTIES (
-    hoodie.datasource.write.base.file.format = 'lance'
-);
-```
-
-Lance integrates seamlessly with Hudi's table services (compaction, 
clustering, cleaning) and
-works alongside existing Parquet and ORC tables.
-
-**[Learn more about the Lance File Format &#8594;](lance_file_format.md)**
-
-## Why Hudi for AI Workloads?
-
-### Unified Storage for Structured + Unstructured Data
-
-Most AI pipelines today span multiple systems: a data warehouse for metadata, 
an object store for raw files,
-a vector database for embeddings, and custom glue code to keep them in sync. 
Hudi collapses this into a single
-table:
-
-```
-┌─────────────────────────────────────────────────┐
-│                  Hudi Table                     │
-│                                                 │
-│  image_id │ breed │ embedding    │ image_bytes  │
-│  (STRING) │(STRING)│(VECTOR(1024))│   (BLOB)    │
-│───────────┼───────┼──────────────┼──────────────│
-│  pet_001  │ Corgi │ [0.12, ...]  │ <137 KB PNG> │
-│  pet_002  │ Tabby │ [-.03, ...]  │ <89 KB JPEG> │
-└─────────────────────────────────────────────────┘
-```
-
-One table. One set of transactions. One schema. One set of access controls.
-
-### Incremental Processing for Embedding Pipelines
-
-When new data arrives, you do not need to re-embed your entire corpus. Hudi's 
incremental query capabilities
-let you process only new or changed records:
-
-```sql
--- Get only new images since the last embedding run
-SELECT * FROM hudi_table_changes('product_images', 'latest_state', 
'20260101000000');
-```
-
-This can reduce embedding pipeline costs by 10-100x compared to full 
reprocessing.
-
-### Transactional Guarantees
-
-Embedding updates, metadata changes, and raw data writes happen atomically. No 
more inconsistent states where
-your vector index points to deleted images or stale embeddings.
-
-### Table Services
-
-Hudi's background table services work on AI tables just like any other:
-
-- **Clustering** — co-locate similar vectors for better search locality
-- **Compaction** — merge incremental updates efficiently
-- **Cleaning** — reclaim storage from old versions
-- **Indexing** — maintain metadata indexes for fast lookups
-
-### Open Ecosystem
-
-Hudi tables are readable by Spark, Flink, Presto, Trino, and the native 
Python/Rust client (hudi-rs). Your AI
-tables are not locked into a single engine or vendor.
-
-## Use Cases
-
-| Use Case | Hudi Capabilities Used |
-|:---------|:----------------------|
-| **Image/Video Search** | VECTOR embeddings + BLOB storage + cosine 
similarity search |
-| **RAG (Retrieval-Augmented Generation)** | VECTOR search to retrieve 
relevant document chunks for LLM context |
-| **LLM Output Management** | VARIANT for flexible response storage, VECTOR 
for semantic indexing |
-| **Recommendation Systems** | VECTOR similarity for collaborative filtering, 
incremental re-embedding |
-| **Content Moderation** | BLOB for raw content + VECTOR for content 
embeddings + incremental processing |
-| **Multimodal Analytics** | Structured metadata + VECTOR embeddings + BLOB 
raw data in one table |
-| **ML Feature Store** | VECTOR for feature embeddings, VARIANT for sparse 
feature maps, time-travel for point-in-time retrieval |
-| **Experiment Tracking** | VARIANT for heterogeneous model configs and 
metrics, incremental queries for latest runs |
-| **Data Labeling Pipelines** | BLOB for raw data, incremental queries for 
unlabeled data, ACID for label updates |
-
-## Getting Started
-
-The fastest way to try these features is the **[AI Quick Start 
Guide](ai-quick-start-guide.md)**, which walks
-you through an end-to-end image similarity search pipeline in under 30 minutes.
diff --git a/website/docs/blob_unstructured_data.md 
b/website/docs/blob_unstructured_data.md
deleted file mode 100644
index 253656565084..000000000000
--- a/website/docs/blob_unstructured_data.md
+++ /dev/null
@@ -1,350 +0,0 @@
----
-title: "Unstructured Data"
-keywords: [ hudi, blob, unstructured data, images, binary, pdf, audio, video, 
inline, out-of-line, read_blob]
-summary: "Store and query unstructured data (images, PDFs, audio, video) in 
Hudi tables using the BLOB type with inline or out-of-line storage"
-toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
----
-
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-The `BLOB` type lets you store raw binary data — images, PDFs, audio clips, 
video files, model weights —
-directly in Hudi tables. Combined with Hudi's transactional guarantees and 
table services, BLOB makes
-the data lakehouse a single source of truth for both structured and 
unstructured data.
-
-## BLOB Type Overview
-
-A BLOB column stores binary data in one of two modes:
-
-| Mode | Storage | Table Footprint | Read Pattern |
-|:-----|:--------|:----------------|:-------------|
-| **Inline** | Raw bytes embedded in the table row | Larger (bytes stored in 
data files) | Direct read — no external fetch |
-| **Out-of-line** | Pointer to external storage location | Very small (< 1% of 
data size) | On-demand via `read_blob()` |
-
-Choose **inline** for small objects that are frequently read together 
(thumbnails, short audio clips,
-small documents). Choose **out-of-line** for large objects where you typically 
query metadata first and
-fetch raw data selectively (high-res images, video, model checkpoints).
-
-## Creating Tables with BLOB Columns
-
-<Tabs
-groupId="api-style"
-defaultValue="sql"
-values={[
-{ label: 'Spark SQL', value: 'sql', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql">
-
-```sql
-CREATE TABLE media_assets (
-    asset_id    STRING,
-    file_name   STRING,
-    mime_type   STRING,
-    file_size   BIGINT,
-    content     BLOB
-) USING hudi
-TBLPROPERTIES (
-    primaryKey = 'asset_id',
-    type = 'cow'
-);
-```
-
-The `BLOB` keyword in DDL automatically configures the column with the correct 
internal structure.
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-import pyarrow as pa
-
-schema = pa.schema([
-    pa.field("asset_id",  pa.string()),
-    pa.field("file_name", pa.string()),
-    pa.field("mime_type", pa.string()),
-    pa.field("file_size", pa.int64()),
-    pa.field("content",   pa.struct([
-        pa.field("type",      pa.string()),
-        pa.field("data",      pa.binary()),
-        pa.field("reference", pa.struct([
-            pa.field("external_path", pa.string()),
-            pa.field("offset",        pa.int64()),
-            pa.field("length",        pa.int64()),
-            pa.field("managed",       pa.bool_()),
-        ])),
-    ]), metadata={b"hudi_type": b"BLOB"}),
-])
-```
-
-The BLOB internal structure is a struct with three fields:
-- `type` — `"INLINE"` or `"OUT_OF_LINE"`
-- `data` — raw bytes (populated for inline, null for out-of-line)
-- `reference` — external storage pointer with subfields:
-  - `external_path` — file path for out-of-line data
-  - `offset` — byte offset in the file (null means read from start)
-  - `length` — byte length to read (null means read to end of file)
-  - `managed` — boolean. Only meaningful for `OUT_OF_LINE` blobs. Marks 
whether Hudi owns the lifecycle of the referenced external file. **Not consumed 
by the cleaner yet** — set the value to record intent, and a future cleaner 
implementation will use it: `true` → cleaner may delete the external file when 
the blob row is no longer referenced; `false` → cleaner will leave the external 
file in place.
-
-</TabItem>
-</Tabs>
-
-## Writing Inline BLOBs
-
-Inline BLOBs embed the raw bytes directly in the table row.
-
-<Tabs
-groupId="api-style"
-defaultValue="sql"
-values={[
-{ label: 'Spark SQL', value: 'sql', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql">
-
-```sql
-INSERT INTO media_assets VALUES (
-    'asset_001',
-    'logo.png',
-    'image/png',
-    45230,
-    named_struct(
-        'type',      'INLINE',
-        'data',      /* binary literal or column reference */,
-        'reference', CAST(NULL AS STRUCT<external_path: STRING, offset: 
BIGINT, length: BIGINT, managed: BOOLEAN>)
-    )
-);
-```
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-from pyspark.sql import Row
-from pyspark.sql.types import *
-
-# Read raw bytes
-with open("logo.png", "rb") as f:
-    raw_bytes = f.read()
-
-row = Row(
-    asset_id="asset_001",
-    file_name="logo.png",
-    mime_type="image/png",
-    file_size=len(raw_bytes),
-    content=Row(type="INLINE", data=raw_bytes, reference=None)
-)
-```
-
-</TabItem>
-</Tabs>
-
-## Writing Out-of-Line BLOBs
-
-Out-of-line BLOBs store a pointer to data in external storage. The actual 
bytes live elsewhere (e.g.,
-a binary container file on S3, a separate object store path, or a shared 
filesystem). The table only
-stores the reference metadata.
-
-```sql
-INSERT INTO media_assets VALUES (
-    'asset_002',
-    'video.mp4',
-    'video/mp4',
-    1073741824,  -- 1 GB
-    named_struct(
-        'type',      'OUT_OF_LINE',
-        'data',      CAST(NULL AS BINARY),
-        'reference', named_struct(
-            'external_path', 's3://my-bucket/media/container_001.bin',
-            'offset',        8388608,       -- byte offset in the container
-            'length',        1073741824,    -- number of bytes
-            'managed',       false          -- intent flag; not consumed by 
the cleaner yet
-        )
-    )
-);
-```
-
-### Container File Pattern
-
-A common pattern for out-of-line storage is to pack multiple objects into a 
single binary container file:
-
-```
-container_001.bin
-├── [offset=0,       len=45230]   → logo.png
-├── [offset=45230,   len=89012]   → photo.jpg
-├── [offset=134242,  len=1073741824] → video.mp4
-└── ...
-```
-
-Each BLOB row stores the `(external_path, offset, length)` triple. This avoids 
creating millions of
-small files on object storage and enables efficient batch access.
-
-## Reading BLOBs
-
-### Querying Metadata (No Fetch)
-
-Standard queries on a BLOB column return the descriptor — not the raw bytes:
-
-```sql
-SELECT asset_id, file_name, content.type, content.reference.external_path
-FROM media_assets;
-```
-
-```
-+----------+-----------+-------------+----------------------------------------+
-|  asset_id|  file_name|         type|                           external_path|
-+----------+-----------+-------------+----------------------------------------+
-| asset_001|   logo.png|       INLINE|                                    null|
-| asset_002|  video.mp4|  OUT_OF_LINE| s3://my-bucket/media/container_001.bin |
-+----------+-----------+-------------+----------------------------------------+
-```
-
-This is fast and lightweight — no binary data is transferred.
-
-### Resolving Raw Bytes with read_blob()
-
-Use the `read_blob()` SQL function to materialize the actual bytes:
-
-```sql
--- Returns raw binary data
-SELECT asset_id, read_blob(content) AS raw_bytes
-FROM media_assets
-WHERE asset_id = 'asset_001';
-```
-
-For **inline** BLOBs, `read_blob()` simply extracts the embedded bytes.
-
-For **out-of-line** BLOBs, `read_blob()` reads from the external path at the 
specified offset and length,
-transparently fetching the data on demand.
-
-:::tip
-Use `read_blob()` selectively — filter first, then resolve. Avoid `SELECT 
read_blob(content) FROM large_table`
-without a WHERE clause, as this will fetch all raw data.
-:::
-
-## Use Cases
-
-### Image Datasets for Computer Vision
-
-Store training images alongside metadata and embeddings:
-
-```sql
-CREATE TABLE training_images (
-    image_id   STRING,
-    label      STRING,
-    split      STRING,           -- 'train', 'val', 'test'
-    embedding  VECTOR(1024),
-    raw_image  BLOB
-) USING hudi TBLPROPERTIES (...);
-
--- Get raw images for a specific label
-SELECT image_id, read_blob(raw_image) AS pixels
-FROM training_images
-WHERE label = 'cat' AND split = 'train';
-```
-
-### Document Store for RAG Pipelines
-
-Store PDF documents alongside their chunk embeddings:
-
-```sql
-CREATE TABLE knowledge_base (
-    doc_id     STRING,
-    chunk_id   STRING,
-    source_url STRING,
-    text       STRING,
-    embedding  VECTOR(1536),
-    original   BLOB              -- original PDF bytes
-) USING hudi TBLPROPERTIES (...);
-
--- Retrieve full document after vector search
-SELECT doc_id, source_url, read_blob(original) AS pdf_bytes
-FROM knowledge_base
-WHERE doc_id IN (SELECT doc_id FROM top_matches);
-```
-
-### Audio/Video Processing Pipelines
-
-```sql
-CREATE TABLE audio_clips (
-    clip_id    STRING,
-    transcript STRING,
-    duration   DOUBLE,
-    embedding  VECTOR(512),
-    audio      BLOB
-) USING hudi TBLPROPERTIES (...);
-```
-
-## Storage Efficiency
-
-Out-of-line BLOBs keep the Hudi table footprint extremely small:
-
-| Metric | Inline | Out-of-Line |
-|:-------|:-------|:------------|
-| Table size vs. raw data | ~100% | < 1% |
-| Query metadata without fetch | Requires reading data files | Only reads 
pointer columns |
-| Random access to raw data | Read full row | Seek to (offset, length) |
-| Best for object size | < 1 MB | > 1 MB |
-
-## Configuration Reference
-
-| Property | Default | Description |
-|:---------|:--------|:------------|
-| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs 
are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference 
pointing at the in-file coordinates of the bytes — no bytes are materialized. 
`CONTENT` materializes the raw inline bytes directly in the `data` field on 
every read. |
-| `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap (in bytes) 
between consecutive byte ranges before they are merged into a single read. 
Larger values reduce I/O calls at the cost of reading some unused bytes. |
-| `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for 
batch read detection. Larger values improve batching for sorted data but 
increase memory usage. |
-
-:::note
-`DESCRIPTOR` mode is the default for all storage formats including Lance. 
`CONTENT` mode is always
-used for internal operations (compaction, merge, log replay) regardless of 
this setting.
-:::
-
-:::caution Calling read_blob() on INLINE columns under DESCRIPTOR mode
-Under the default `DESCRIPTOR` mode, calling `read_blob()` on an INLINE BLOB 
column **throws** —
-the raw bytes are not materialized in the scan, so there is nothing for 
`read_blob()` to return.
-To read inline bytes with `read_blob()`, switch to `CONTENT` mode first:
-
-```sql
-SET hoodie.read.blob.inline.mode=CONTENT;
-SELECT asset_id, read_blob(content) AS raw_bytes
-FROM media_assets
-WHERE asset_id = 'asset_001';
-```
-
-This setting affects only INLINE columns — OUT_OF_LINE columns always fetch 
from the external path
-regardless of mode.
-:::
-
-## Metastore Sync
-
-When syncing BLOB column schemas to Hive or BigQuery, Hudi maps the BLOB 
struct to the target
-catalog's native struct type:
-
-| Catalog | BLOB representation |
-|:--------|:-------------------|
-| Hive | `STRUCT<type:STRING, data:BINARY, 
reference:STRUCT<external_path:STRING, offset:BIGINT, length:BIGINT, 
managed:BOOLEAN>>` |
-| BigQuery | Equivalent `STRUCT` fields |
-
-The raw binary payload is preserved in the struct representation, but 
`read_blob()` is a Spark SQL
-function and is not available in Hive or BigQuery directly.
-
-## Best Practices
-
-1. **Choose the right mode** — Use inline for small, frequently-accessed 
objects. Use out-of-line for
-   anything over 1 MB.
-
-2. **Filter before resolving** — Always apply WHERE predicates before calling 
`read_blob()` to avoid
-   unnecessary data transfer.
-
-3. **Batch container files** — When using out-of-line mode, pack multiple 
objects into container files
-   rather than storing one file per object.
-
-4. **Combine with VECTOR** — Pair BLOB columns with VECTOR columns for 
powerful "search then retrieve"
-   workflows: vector search narrows candidates, then `read_blob()` fetches 
just the winners.
-
-5. **Use incremental queries** — Process only new BLOBs by leveraging Hudi's 
incremental query support:
-   ```sql
-   SELECT * FROM hudi_table_changes('media_assets', 'latest_state', 
'20260401000000');
-   ```
diff --git a/website/docs/ingestion_flink.md b/website/docs/ingestion_flink.md
index 2e534e286d67..774b8abe206d 100644
--- a/website/docs/ingestion_flink.md
+++ b/website/docs/ingestion_flink.md
@@ -544,6 +544,26 @@ SELECT event_id, _hoodie_commit_time, payload FROM events;
 Only `VIRTUAL` metadata columns are supported. All valid virtual columns 
correspond to Hudi's built-in meta fields (`_hoodie_commit_time`, 
`_hoodie_commit_seqno`, `_hoodie_record_key`, `_hoodie_partition_path`, 
`_hoodie_file_name`, `_hoodie_operation`).
 :::
 
+## Engine constraints for types
+
+Flink behavior of the new column types ([`VECTOR`](sql_ddl.md#vector),
+[`BLOB`](sql_ddl.md#blob), [`VARIANT`](sql_ddl.md#variant)) differs from Spark 
in a few places:
+
+`VECTOR` columns are stored as Parquet `FIXED_LEN_BYTE_ARRAY`, which Flink's 
Parquet reader does
+not convert back into a typed array. The other columns of the same table read 
fine; only the
+`VECTOR` column itself is inaccessible from Flink. Use Spark to query `VECTOR` 
columns.
+
+Native `VARIANT` operations require Flink 2.1+. Flink &lt; 2.1 throws
+`UnsupportedOperationException` on VARIANT columns. On Flink 2.1+, VARIANT 
columns surface as
+`ROW<metadata BYTES, value BYTES>`. Flink can read the underlying struct but 
cannot decode it as a
+variant value.
+
+`read_blob()` is a Spark SQL function. From Flink, queries on a `BLOB` column 
return the underlying
+struct directly.
+
+The Lance base file format is Spark-only. Reading a Lance-backed table from 
Flink throws
+`HoodieValidationException`; see [Storage Layouts → 
Lance](storage_layouts.md#lance-base-file-format).
+
 ## Advanced Options
 
 ### Hadoop Configuration Pass-through
diff --git a/website/docs/lance_file_format.md 
b/website/docs/lance_file_format.md
deleted file mode 100644
index 9754a4e31797..000000000000
--- a/website/docs/lance_file_format.md
+++ /dev/null
@@ -1,220 +0,0 @@
----
-title: "Lance File Format"
-keywords: [ hudi, lance, file format, vector, AI, ML, columnar, ANN, indexing]
-summary: "Use the Lance columnar file format with Hudi for vector-friendly 
storage and efficient ML workloads"
-toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
----
-
-[Lance](https://lancedb.github.io/lance/) is a modern columnar data format 
designed for AI and machine learning
-workloads. Hudi's pluggable storage architecture lets you use Lance as the 
base file format alongside Parquet
-and ORC, unlocking vector indexing, fast random access, and optimized 
high-dimensional array storage.
-
-:::caution Engine Support
-Lance file format support is **Spark-only**. Attempting to read a Lance-backed 
table from Flink or Hive throws a
-`HoodieValidationException`:
-> Lance base file format is currently only supported with the Spark engine. 
Please use Parquet, ORC, or HFile
-> for non-Spark engines (Flink, Hive, Presto, Trino).
-
-The Lance JAR is **not bundled** in the Hudi distribution — you must add it to 
your Spark classpath
-(see [Required Dependencies](#required-dependencies)).
-:::
-
-## Enabling Lance in Hudi
-
-### Table Creation (COW)
-
-Set the base file format to `lance` in table properties:
-
-```sql
-CREATE TABLE my_ai_table (
-    id        STRING,
-    embedding VECTOR(768),
-    metadata  STRING
-) USING hudi
-TBLPROPERTIES (
-    primaryKey = 'id',
-    type = 'cow',
-    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
-    hoodie.table.base.file.format = 'lance'
-);
-```
-
-### Table Creation (MOR)
-
-Lance base files work with MOR tables — Lance files act as base files while 
Avro log files capture
-incremental changes. Log compaction merges the delta log back into Lance base 
files.
-
-```sql
-CREATE TABLE my_ai_table_mor (
-    id        STRING,
-    embedding VECTOR(768),
-    metadata  STRING
-) USING hudi
-TBLPROPERTIES (
-    primaryKey = 'id',
-    type = 'mor',
-    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
-    hoodie.table.base.file.format = 'lance'
-);
-```
-
-### DataFrame API
-
-```python
-(df.write
-   .format("hudi")
-   .option("hoodie.table.name", "my_ai_table")
-   .option("hoodie.datasource.write.recordkey.field", "id")
-   .option("hoodie.record.merger.impls",
-           "org.apache.hudi.DefaultSparkRecordMerger")
-   .option("hoodie.table.base.file.format", "lance")
-   .mode("overwrite")
-   .save("/path/to/my_ai_table"))
-```
-
-### Required Dependencies
-
-The Lance JAR is not bundled in Hudi. Add the appropriate Lance Spark bundle 
JAR to your Spark classpath:
-
-| Spark Version | Bundle JAR (Maven Central) |
-|:--------------|:---------------------------|
-| Spark 3.4 | `org.lance:lance-spark-bundle-3.4_2.12:0.4.0` |
-| Spark 3.5 | `org.lance:lance-spark-bundle-3.5_2.12:0.4.0` |
-| Spark 4.0 | `org.lance:lance-spark-bundle-4.0_2.13:0.4.0` |
-| Spark 4.1 | `org.lance:lance-spark-bundle-4.1_2.13:0.4.0` |
-
-```bash
-export LANCE_BUNDLE_JAR=/path/to/lance-spark-bundle-3.5_2.12-0.4.0.jar
-
-# Include both Hudi and Lance JARs
-spark-shell --jars $HUDI_BUNDLE_JAR,$LANCE_BUNDLE_JAR
-```
-
-## How Hudi + Lance Work Together
-
-Hudi manages the table layer — transactions, schema, timeline, table services 
— while Lance handles the
-file-level storage:
-
-```
-┌───────────────────────────────────┐
-│         Hudi Table Layer          │
-│  Timeline, Metadata, Indexing     │
-│  Transactions, Schema Evolution   │
-├───────────────────────────────────┤
-│     File Group / File Slice       │
-│  (same Hudi concepts as Parquet)  │
-├───────────────────────────────────┤
-│     Lance Data Files (.lance)     │
-│  Columnar storage                 │
-│  Fragment-based layout            │
-├───────────────────────────────────┤
-│   Storage (S3, GCS, HDFS, FS)    │
-└───────────────────────────────────┘
-```
-
-All Hudi table services work with Lance-backed tables:
-
-- **Compaction** — merges log files into Lance base files
-- **Clustering** — reorganizes Lance files for better data locality
-- **Cleaning** — removes old Lance file versions
-- **Metadata indexing** — bloom filters work across Lance files; column stats 
and partition stats are
-  **automatically disabled** for Lance tables
-
-## VECTOR Storage on Lance
-
-VECTOR columns are stored natively in Lance as `FixedSizeList<Float32/Float64, 
dim>` — Lance's own
-vector column encoding, so embeddings are written without conversion overhead 
at the file-format
-layer.
-
-Only **FLOAT** and **DOUBLE** element types are supported as VECTOR columns on 
Lance. INT8 vectors
-are not yet supported and will fail fast at write time.
-
-See [Vector Search](vector_search.md) for the `hudi_vector_search` TVF that 
queries VECTOR columns.
-
-## BLOB Columns on Lance
-
-INLINE BLOB columns on Lance default to `DESCRIPTOR` read mode — standard 
queries return an
-out-of-line-shaped reference descriptor rather than materializing the raw 
bytes. To read inline
-byte content via `read_blob()`, set `hoodie.read.blob.inline.mode=CONTENT`. See
-[Unstructured Data](blob_unstructured_data.md) for full documentation.
-
-## Schema Evolution
-
-Lance supports the following schema changes at the Hudi layer:
-
-| Operation | Supported? |
-|:----------|:-----------|
-| Add column | Yes |
-| Rename column | Yes (via Hudi schema evolution) |
-| Promote `FLOAT` → `DOUBLE` | **No** — not supported on Lance |
-| Promote `FLOAT` → `STRING` | **No** — not supported on Lance |
-| Drop column | Yes |
-
-:::caution
-`FLOAT → DOUBLE` and `FLOAT → STRING` type promotions are supported for 
Parquet tables but **not**
-for Lance. Attempting these on a Lance table will fail. Use `DOUBLE` from the 
start if you anticipate
-needing higher precision.
-:::
-
-## Vector Search with Lance
-
-Use the `hudi_vector_search` TVF to run vector similarity queries against 
VECTOR columns on a
-Lance-backed table:
-
-```sql
-SELECT id, metadata, _hudi_distance
-FROM hudi_vector_search(
-    'my_ai_table', 'embedding',
-    ARRAY(0.1, 0.2, ...),  -- query vector
-    10, 'cosine'
-)
-ORDER BY _hudi_distance;
-```
-
-See [Vector Search](vector_search.md) for full documentation on the TVF and 
distance metrics.
-
-## Configuration Reference
-
-| Property | Default | Description |
-|:---------|:--------|:------------|
-| `hoodie.table.base.file.format` | `parquet` | Set to `lance` to use Lance as 
the base file format. |
-| `hoodie.record.merger.impls` | — | Must be 
`org.apache.hudi.DefaultSparkRecordMerger` for Lance. |
-| `hoodie.lance.max.file.size` | `125829120` (120 MiB) | Target file size in 
bytes for Lance base files. |
-| `hoodie.lance.write.allocator.size.bytes` | `268435456` (256 MiB) | Maximum 
size of the Arrow child allocator used for buffering in-flight batch data. 
Increase for tables with very large BLOB columns. |
-| `hoodie.lance.write.flush.byte.watermark` | `100663296` (96 MiB) | Byte-size 
threshold at which the current write batch is flushed. Must be less than 
`hoodie.lance.write.allocator.size.bytes`. |
-
-### File Sizing and Memory
-
-The three sizing configs work together:
-
-- **`hoodie.lance.max.file.size`** controls when Hudi rolls over to a new 
Lance file, similar to
-  `hoodie.parquet.max.file.size` for Parquet tables.
-- **`hoodie.lance.write.allocator.size.bytes`** caps the Arrow allocator's 
in-flight memory. Arrow
-  uses power-of-2 buffer doubling; the default 256 MiB accommodates the 128 
MiB doubling step with
-  headroom.
-- **`hoodie.lance.write.flush.byte.watermark`** triggers an early batch flush 
when Arrow buffers
-  approach the cap. The default 96 MiB (≈ 3/8 of the allocator cap) leaves 
room for offset and
-  validity buffers to double without exceeding the allocator limit.
-
-For tables with large BLOB columns, increase both 
`hoodie.lance.write.allocator.size.bytes` and
-`hoodie.lance.write.flush.byte.watermark` proportionally (keep watermark at 
roughly 3/8 of allocator
-size).
-
-## Additional Notes
-
-- **`populateMetaFields=false`** is supported. User-defined key generators 
work normally with Lance
-  tables.
-- **Complex types** (struct, array, map) are supported as Lance columns.
-- **VARIANT columns** are **not supported** on Lance. Attempting to write a 
table with VARIANT columns
-  to Lance throws a `HoodieNotSupportedException`. Use Parquet for tables with 
VARIANT columns.
-
-## Mixed-Format Tables
-
-Hudi does not require all tables in a lakehouse to use the same file format. 
You can have:
-
-- **Analytical tables** on Parquet — for traditional BI/SQL workloads
-- **AI tables** on Lance — for embeddings, vector search, and ML feature 
storage
-- **Archival tables** on ORC — for compressed long-term storage
-
-All share the same Hudi catalog, metadata, and tooling.
diff --git a/website/docs/overview.mdx b/website/docs/overview.mdx
index ed439cef38a1..cc8c7707927b 100644
--- a/website/docs/overview.mdx
+++ b/website/docs/overview.mdx
@@ -51,21 +51,10 @@ Sometimes the fastest way to learn is by doing. Try out 
these Quick Start resour
 - [Spark Quick Start Guide](quick-start-guide.md) – if you primarily use 
Apache Spark
 - [Flink Quick Start Guide](flink-quick-start-guide.md) – if you primarily use 
Apache Flink
 - [Python/Rust Quick Start Guide (Hudi-rs)](python-rust-quick-start-guide.md) 
- if you primarily use Python or Rust
-- [AI Quick Start Guide](ai-quick-start-guide.md) – if you are building AI/ML 
pipelines with embeddings, vector search, or unstructured data
+- [Unstructured Data Quick Start 
Guide](unstructured-data-quick-start-guide.md) – store embeddings (`VECTOR`) 
and raw bytes (`BLOB`) and run similarity search with `hudi_vector_search`
 
 If you want to experience Apache Hudi integrated into an end to end demo with 
Kafka, Spark, Hive, Presto, etc, try out the [Docker Demo](docker_demo.md)
 
-## AI-Native Capabilities
-
-Hudi brings first-class support for AI and unstructured data workloads to the 
data lakehouse:
-
-- **[VECTOR type & Similarity Search](vector_search.md)** — Store embeddings 
and run vector similarity search directly in Spark SQL
-- **[BLOB type for Unstructured Data](blob_unstructured_data.md)** — Store 
images, PDFs, audio, and other binary data with inline or out-of-line storage
-- **[VARIANT type for Semi-Structured Data](variant_type.md)** — Store 
flexible JSON-like data (LLM outputs, model metadata, feature maps) without 
rigid schemas
-- **[Lance File Format](lance_file_format.md)** — Vector-friendly columnar 
format for AI/ML workloads
-
-See the full [AI-Native Lakehouse Overview](ai_overview.md) for use cases and 
architecture.
-
 ## Connect With The Community
 Apache Hudi is community-focused and community-led and welcomes new-comers 
with open arms. Leverage the following
 resources to learn more, engage, and get help as you get started.
diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index 2fc0d421dab2..3792f3248b4d 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -146,6 +146,183 @@ TBLPROPERTIES (
 LOCATION 'file:///tmp/hudi_table_merge_mode_custom/';
 ```
 
+### Create table with unstructured and semi-structured column types 
{#create-table-with-unstructured-and-semi-structured-column-types}
+
+Hudi supports three column types for unstructured and semi-structured data, 
plus a Lance base file
+format.
+
+#### `VECTOR(dim[, elementType])` {#vector}
+
+A fixed-dimension embedding column. `elementType` is one of `FLOAT` (default), 
`DOUBLE`, or `INT8`
+(alias `BYTE`):
+
+| Element type | Storage |
+|:-------------|:--------|
+| `FLOAT` (default) | `ArrayType(FloatType)` |
+| `DOUBLE` | `ArrayType(DoubleType)` |
+| `INT8` / `BYTE` | `ArrayType(ByteType)` |
+
+```sql
+CREATE TABLE products (
+    product_id   STRING,
+    name         STRING,
+    embedding    VECTOR(768)          -- defaults to FLOAT
+) USING hudi
+TBLPROPERTIES (
+    primaryKey = 'product_id',
+    type = 'cow',
+    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger'
+);
+
+-- Other element types
+-- embedding VECTOR(768, FLOAT)
+-- embedding VECTOR(768, DOUBLE)
+-- embedding VECTOR(256, INT8)
+```
+
+Hudi's SQL parser normalizes `VECTOR(128, FLOAT)` to `VECTOR(128)`.
+
+VECTOR columns must be **top-level fields**; nesting inside `STRUCT`, `ARRAY`, 
or `MAP` is not
+supported. Dimension and element type cannot be changed via schema evolution 
after table creation.
+
+**Internal storage.** In Parquet, a VECTOR column is stored as 
`FIXED_LEN_BYTE_ARRAY` with
+`hudi_type=VECTOR(dim[,elementType])` field metadata so the Spark reader 
decodes the bytes back into
+a typed array. On Lance-backed tables, VECTOR is stored natively as
+`FixedSizeList<Float32/Float64, dim>`; only `FLOAT` and `DOUBLE` element types 
are accepted on Lance.
+
+Query VECTOR columns with the [`hudi_vector_search` 
TVF](sql_queries.md#vector-similarity-search).
+
+**Engine constraints.** Flink cannot decode VECTOR columns (the underlying 
Parquet
+`FIXED_LEN_BYTE_ARRAY` is not converted back into a typed array); Flink can 
still read other columns
+in a table that contains a VECTOR column.
+
+#### `BLOB` {#blob}
+
+A binary column with two storage modes:
+
+| Mode | Storage | Read pattern |
+|:-----|:--------|:-------------|
+| `INLINE` | Raw bytes embedded in the table row | Direct read; no external 
fetch |
+| `OUT_OF_LINE` | Pointer to a byte range in an external file | On-demand via 
[`read_blob()`](sql_queries.md#reading-blob-columns) |
+
+```sql
+CREATE TABLE media_assets (
+    asset_id    STRING,
+    file_name   STRING,
+    mime_type   STRING,
+    file_size   BIGINT,
+    content     BLOB
+) USING hudi
+TBLPROPERTIES (
+    primaryKey = 'asset_id',
+    type = 'cow'
+);
+```
+
+**Internal struct.** A BLOB column is represented internally as:
+
+```
+STRUCT<
+  type:      STRING,        -- 'INLINE' or 'OUT_OF_LINE'
+  data:      BINARY,        -- raw bytes for INLINE; null for OUT_OF_LINE
+  reference: STRUCT<
+    external_path: STRING,  -- file path for OUT_OF_LINE
+    offset:        BIGINT,  -- byte offset; null = start of file
+    length:        BIGINT,  -- byte length; null = read to end
+    managed:       BOOLEAN  -- advisory flag for the (future) cleaner
+  >
+>
+```
+
+`managed` is currently advisory: `true` records the intent that Hudi owns the 
lifecycle of the
+referenced external file (so a future cleaner may delete it when no row 
references it), `false`
+records that the file is externally managed. The cleaner does not yet consume 
this field.
+
+BLOB columns are excluded from column-stats indexing.
+
+#### `VARIANT` {#variant}
+
+A semi-structured (JSON-like) column that stores any JSON-compatible value 
(object, array, string,
+number, boolean, null) as two binary fields:
+
+| Field | Description |
+|:------|:------------|
+| `metadata` | Encodes field names, types, and structure for efficient access |
+| `value` | The data payload |
+
+On Spark 4.0+, declare the native type:
+
+```sql
+CREATE TABLE events (
+    event_id  STRING,
+    payload   VARIANT,
+    ts        BIGINT
+) USING hudi
+TBLPROPERTIES (
+    primaryKey = 'event_id',
+    preCombineField = 'ts'
+);
+```
+
+Spark 3.x has no native `VariantType`. To read a VARIANT-bearing Hudi table 
written by Spark 4.x,
+declare the column as a binary struct in the `CREATE TABLE` DDL pointing at 
the table location:
+
+```sql
+CREATE TABLE events (
+    event_id  STRING,
+    payload   STRUCT<value: BINARY, metadata: BINARY>,
+    ts        BIGINT
+) USING hudi
+LOCATION '<existing-table-path>'
+TBLPROPERTIES (
+    primaryKey = 'event_id',
+    preCombineField = 'ts'
+);
+```
+
+Spark 3.x then returns the raw `metadata` and `value` bytes; it does not 
surface the column as a
+logical VARIANT, and reading a VARIANT table without this explicit DDL (i.e. 
letting Hudi
+auto-resolve the schema from commit metadata) throws `"VARIANT type is only 
supported in Spark
+4.0+"`.
+
+Engine support for `VARIANT`:
+
+| Engine | Behavior |
+|:-------|:---------|
+| Spark 4.0+ | Native `VariantType` for read/write/query on COW and MOR 
(CREATE TABLE with `VARIANT` or DataFrame writes with `VariantType`). |
+| Spark 4.1 | Same as Spark 4.0. Spark 4.1's `PushVariantIntoScan` rewrites 
VARIANT projections into struct-of-extractions; Hudi recognizes that shape and 
returns the column as a logical VARIANT. |
+| Spark 3.x | No native VARIANT. Backward-compat read of a Spark 4.x-written 
table requires the explicit binary-struct DDL above; raw bytes only. |
+| Flink &lt; 2.1 | Throws `UnsupportedOperationException` on VARIANT columns. |
+| Flink ≥ 2.1 | Surfaces VARIANT as `ROW<metadata BYTES, value BYTES>`. Flink 
can read the underlying struct but cannot decode it as a variant value. |
+
+VARIANT columns are not supported on Lance-backed tables.
+
+Only unshredded VARIANT is exposed at the user-facing level: every Spark/Hudi 
schema conversion
+produces an unshredded VARIANT (two binary fields, `metadata` and `value`). 
The shredded variant
+write path exists in the engine but has no DDL, table-property, or 
session-config to enable it.
+
+#### Lance base file format
+
+Set the base file format per-table:
+
+```sql
+CREATE TABLE my_ai_table (
+    id        STRING,
+    embedding VECTOR(768),
+    metadata  STRING
+) USING hudi
+TBLPROPERTIES (
+    primaryKey = 'id',
+    type = 'cow',
+    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
+    hoodie.table.base.file.format = 'lance'
+);
+```
+
+Lance also works with MOR tables: Lance files act as base files while Avro log 
files capture
+incremental changes. See [Storage Layouts → 
Lance](storage_layouts.md#lance-base-file-format) for
+configuration, dependencies, and behavior.
+
 ### Create table from an external location
 Often, Hudi tables are created from streaming writers like the [streamer 
tool](hoodie_streaming_ingestion.md#hudi-streamer), which
 may later need some SQL statements to run on them. You can create an External 
table using the `location` statement.
@@ -1003,20 +1180,11 @@ WITH (
 | array         |  array       |               |
 | map           |  map         |               |
 | struct        |  struct      |               |
+| `ArrayType(<elementType>)` with field metadata `hudi_type=VECTOR(dim[, 
elementType])` | VECTOR | Fixed-dimension embedding column. Element type is 
`FloatType`, `DoubleType`, or `ByteType` (INT8). See [Create table with 
unstructured and semi-structured column 
types](#create-table-with-unstructured-and-semi-structured-column-types). |
+| `StructType(type STRING, data BINARY, reference STRUCT<…>)` with field 
metadata `hudi_type=BLOB` | BLOB | Binary column with `INLINE` / `OUT_OF_LINE` 
storage. See [Create table with unstructured and semi-structured column 
types](#create-table-with-unstructured-and-semi-structured-column-types). |
+| `VariantType` (Spark 4.0+) or `StructType(metadata BINARY NOT NULL, value 
BINARY NOT NULL)` with field metadata `hudi_type=VARIANT` (Spark 3.x) | VARIANT 
| Semi-structured (JSON-like) column (unshredded). See [Create table with 
unstructured and semi-structured column 
types](#create-table-with-unstructured-and-semi-structured-column-types). |
 | char          |              | not supported |
 | varchar       |              | not supported |
 | numeric       |              | not supported |
 | null          |              | not supported |
 | object        |              | not supported |
-
-### AI and Unstructured Data Types
-
-Hudi 1.2.0 introduces two additional column types for AI and unstructured data 
workloads:
-
-- **`VECTOR(dim[, elementType])`** — stores fixed-dimension embedding vectors 
(e.g. `VECTOR(768)`,
-  `VECTOR(768, FLOAT)`, `VECTOR(768, DOUBLE)`). Enables approximate 
nearest-neighbor search via
-  the `hudi_vector_search` TVF. See [Vector Search](vector_search.md) for full 
details.
-
-- **`BLOB`** — stores arbitrary binary objects (images, audio, documents) 
either inline within the
-  base file or as external references. See [BLOB / Unstructured 
Data](blob_unstructured_data.md)
-  for the storage modes, DDL syntax, and read APIs.
diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md
index d3caf5511fa2..43e38e813971 100644
--- a/website/docs/sql_dml.md
+++ b/website/docs/sql_dml.md
@@ -48,6 +48,81 @@ INSERT INTO hudi_cow_pt_tbl PARTITION(dt = '2021-12-09', 
hh='11') SELECT 2, 'a2'
 INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name, 
1000 AS ts, '2021-12-09' AS dt, '10' AS hh;
 ```
 
+#### Inserting VECTOR columns
+
+A [`VECTOR(dim[, elementType])`](sql_ddl.md#vector) column accepts an `ARRAY` 
whose length and
+element type match the declared dimension and element type:
+
+```sql
+INSERT INTO products VALUES (
+    'prod_001', 'Running Shoes', 'Lightweight trail runner',
+    ARRAY(0.123, -0.456, 0.789, /* ... 768 floats ... */)
+);
+```
+
+The query vector's element type must exactly match the corpus embedding's 
element type (no implicit
+casting).
+
+#### Inserting BLOB columns
+
+A [`BLOB`](sql_ddl.md#blob) column is internally a struct with `type` 
(`INLINE`/`OUT_OF_LINE`),
+`data`, and a `reference` pointer struct. Construct it via `named_struct`.
+
+**Inline** — bytes embedded directly in the row:
+
+```sql
+INSERT INTO media_assets VALUES (
+    'asset_001', 'logo.png', 'image/png', 45230,
+    named_struct(
+        'type',      'INLINE',
+        'data',      /* binary literal or column reference */,
+        'reference', CAST(NULL AS STRUCT<external_path: STRING, offset: 
BIGINT, length: BIGINT, managed: BOOLEAN>)
+    )
+);
+```
+
+**Out-of-line** — pointer to a byte range in an external file:
+
+```sql
+INSERT INTO media_assets VALUES (
+    'asset_002', 'video.mp4', 'video/mp4', 1073741824,
+    named_struct(
+        'type',      'OUT_OF_LINE',
+        'data',      CAST(NULL AS BINARY),
+        'reference', named_struct(
+            'external_path', 's3://my-bucket/media/container_001.bin',
+            'offset',        8388608,
+            'length',        1073741824,
+            'managed',       false
+        )
+    )
+);
+```
+
+A common pattern for `OUT_OF_LINE` storage is to pack multiple objects into a 
single binary
+container file and have each BLOB row store the `(external_path, offset, 
length)` triple into that
+container:
+
+```
+container_001.bin
+├── [offset=0,       len=45230]      → logo.png
+├── [offset=45230,   len=89012]      → photo.jpg
+├── [offset=134242,  len=1073741824] → video.mp4
+└── ...
+```
+
+#### Inserting VARIANT columns
+
+A [`VARIANT`](sql_ddl.md#variant) column accepts any JSON-compatible value. On 
Spark 4.0+, build
+VARIANT values with `parse_json`:
+
+```sql
+INSERT INTO events VALUES
+    ('evt_001', parse_json('{"action": "click", "x": 120, "y": 450}'), 1000),
+    ('evt_002', parse_json('{"action": "purchase", "items": ["sku_a", 
"sku_b"], "total": 59.99}'), 1001),
+    ('evt_003', parse_json('"simple string value"'), 1002);
+```
+
 :::note Mapping to write operations
 Hudi offers flexibility in choosing the underlying [write 
operation](write_operations.md) of a `INSERT INTO` statement using 
 the `hoodie.spark.sql.insert.into.operation` configuration. Possible options 
include *"bulk_insert"* (large inserts), *"insert"* (with small file 
management), 
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
index 4b9d6c2dd695..efa067c77039 100644
--- a/website/docs/sql_queries.md
+++ b/website/docs/sql_queries.md
@@ -352,16 +352,186 @@ time to completion time depending on the source table 
version.
 
 ### Vector Similarity Search
 
-Hudi 1.2.0 introduces a `hudi_vector_search` table-valued function (TVF) for 
approximate
-nearest-neighbor (ANN) search over `VECTOR` columns. This is an extension of 
the
-`hudi_table_changes` TVF pattern.
+The `hudi_vector_search` table-valued function (TVF) runs top-K similarity 
search over a
+[`VECTOR`](sql_ddl.md#vector) column. It returns the `top_k` rows whose VECTOR 
column is closest to
+a query vector under the chosen distance metric.
 
 ```sql
--- Find the 10 nearest neighbors to a query vector in the 'embedding' column
-SELECT * FROM hudi_vector_search('db.embeddings_table', 'embedding', 
ARRAY(0.1, 0.2, ...), 10);
+SELECT *
+FROM hudi_vector_search(
+    table_name,        -- STRING: registered table name or path
+    vector_column,     -- STRING: VECTOR column name
+    query_vector,      -- ARRAY: query embedding
+    top_k,             -- INT: number of nearest neighbors
+    [distance_metric], -- STRING: 'cosine' (default), 'l2', 'dot_product'
+    [algorithm]        -- STRING: 'brute_force' (default)
+)
+```
+
+Parameters:
+
+| Parameter | Type | Default | Description |
+|:----------|:-----|:--------|:------------|
+| `table_name` | STRING | (required) | Registered table name or table path. |
+| `vector_column` | STRING | (required) | Name of the VECTOR column. |
+| `query_vector` | ARRAY&lt;FLOAT&gt; | (required) | Query embedding; must 
match the column's dimension and element type. |
+| `top_k` | INT | (required) | Number of nearest neighbors. |
+| `distance_metric` | STRING | `'cosine'` | One of `'cosine'`, `'l2'`, 
`'dot_product'`. |
+| `algorithm` | STRING | `'brute_force'` | Only `'brute_force'` is currently 
supported. |
+
+Return schema: all columns from the source table (excluding the embedding 
column) plus
+`_hudi_distance DOUBLE`. Results are ordered by `_hudi_distance` ascending, so 
the closest matches
+come first.
+
+Distance metrics:
+
+| Metric | Formula | Range | Notes |
+|:-------|:--------|:------|:------|
+| `cosine` | 1 − cos(a, b), clamped to [0, 2] | [0, 2] | Returns 1.0 for zero 
vectors. |
+| `l2` | sqrt(sum((a[i] − b[i])²)) | [0, +∞) | |
+| `dot_product` | −(a · b) | (−∞, +∞) | Negated so ascending sort surfaces the 
most similar rows first. |
+
+```sql
+-- Find the 10 nearest neighbors to a query embedding
+SELECT product_id, name, _hudi_distance
+FROM hudi_vector_search('products', 'embedding',
+                        ARRAY(0.12, -0.03, 0.87, /* ... */),
+                        10, 'cosine')
+ORDER BY _hudi_distance;
+```
+
+RAG context retrieval to apply a distance threshold to drop weak matches:
+
+```sql
+-- Retrieve the 5 most relevant document chunks for an LLM prompt
+SELECT chunk_id, text_content, _hudi_distance
+FROM hudi_vector_search(
+    'document_chunks', 'embedding',
+    ARRAY(/* embedding of the user's question */),
+    5, 'cosine'
+)
+WHERE _hudi_distance < 0.3;
+```
+
+Cross-modal search to query a corpus of image embeddings with a text embedding 
(e.g., CLIP):
+
+```sql
+SELECT image_id, caption, _hudi_distance
+FROM hudi_vector_search(
+    'image_catalog', 'clip_embedding',
+    ARRAY(/* text embedding from CLIP */),
+    20, 'cosine'
+);
+```
+
+The TVF also works on Lance-backed tables. Set 
`hoodie.table.base.file.format=lance` at table
+creation (see [Storage Layouts → 
Lance](storage_layouts.md#lance-base-file-format)); the call site
+is unchanged.
+
+:::note
+`cosine` distance computes `1 − cos(a, b)`. If embeddings are not 
L2-normalized before write,
+results reflect both vector direction and magnitude.
+:::
+
+#### `hudi_vector_search_batch`
+
+The batch variant runs many query vectors at once:
+
+```sql
+SELECT *
+FROM hudi_vector_search_batch(
+    corpus_table,           -- STRING
+    corpus_embedding_col,   -- STRING
+    query_table,            -- STRING: table holding query vectors
+    query_embedding_col,    -- STRING: VECTOR column in query table
+    top_k,                  -- INT: neighbors per query
+    [distance_metric],
+    [algorithm]
+)
+```
+
+Return schema: corpus columns + query columns + `_hudi_distance DOUBLE` +
+`_hudi_query_index LONG` (identifies the query that produced the row). If 
corpus and query share
+column names, query columns are prefixed with `_hudi_query_`.
+
+### Reading BLOB Columns
+
+The `read_blob()` SQL function materializes raw bytes from a 
[`BLOB`](sql_ddl.md#blob) column. It
+handles both storage modes uniformly: for `INLINE`, it extracts the embedded 
bytes; for
+`OUT_OF_LINE`, it reads `reference.length` bytes starting at 
`reference.offset` from
+`reference.external_path`.
+
+```sql
+SELECT asset_id, read_blob(content) AS raw_bytes
+FROM media_assets
+WHERE asset_id = 'asset_001';
+```
+
+Standard queries on a BLOB column return the underlying struct (descriptor + 
bytes), not raw bytes:
+
+```sql
+SELECT asset_id, content.type, content.reference.external_path
+FROM media_assets;
+```
+
+#### Inline read mode
+
+| Property | Default | Description |
+|:---------|:--------|:------------|
+| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how `INLINE` BLOBs 
surface on read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference 
pointing at the in-file coordinates of the bytes. No bytes are materialized. 
`CONTENT` materializes the raw inline bytes in the `data` field on every read. |
+| `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap between 
consecutive byte ranges before they are merged into a single read. Larger 
values reduce I/O calls at the cost of reading some unused bytes. |
+| `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for 
batch-read detection. Larger values improve batching for sorted data but 
increase memory usage. |
+
+`CONTENT` mode is always used for internal operations (compaction, merge, log 
replay) regardless of
+this setting.
+
+:::caution Calling `read_blob()` on INLINE columns under DESCRIPTOR mode
+Under the default `DESCRIPTOR` mode, calling `read_blob()` on an `INLINE` BLOB 
column throws.
+The raw bytes are not materialized in the scan, so there is nothing for 
`read_blob()` to return.
+To read inline bytes with `read_blob()`, switch to `CONTENT` mode first:
+
+```sql
+SET hoodie.read.blob.inline.mode=CONTENT;
+SELECT asset_id, read_blob(content) AS raw_bytes
+FROM media_assets WHERE asset_id = 'asset_001';
+```
+
+This setting affects only `INLINE` columns. `OUT_OF_LINE` always fetches from 
the external path.
+:::
+
+`read_blob()` is a Spark SQL function; Hive, BigQuery, and other engines 
reading the underlying
+struct directly do not have it. Apply predicates before calling `read_blob()` 
to bound the bytes
+resolved per row.
+
+### Querying VARIANT Columns
+
+[`VARIANT`](sql_ddl.md#variant) columns hold semi-structured (JSON-like) data. 
Cast to `STRING` for
+JSON output:
+
+```sql
+SELECT event_id, cast(payload as STRING) AS payload_json FROM events;
+```
+
+VARIANT columns support `UPDATE`, `DELETE`, and `MERGE` on both COW and MOR 
tables.
+
+#### End-to-end example
+
+`hudi_vector_search` and `read_blob()` compose in a single query that returns 
both the matching
+rows and the materialized bytes for each:
+
+```sql
+SELECT image_id, category,
+       read_blob(image_bytes) AS resolved_bytes,
+       _hudi_distance
+FROM hudi_vector_search(
+    '/tmp/hudi_pets', 'embedding',
+    ARRAY(0.12, -0.03, /* ... */),
+    5, 'cosine'
+)
+ORDER BY _hudi_distance;
 ```
 
-See [Vector Search](vector_search.md) for the full API, supported metrics, and 
setup instructions.
+See [Unstructured Data Quick Start 
Guide](unstructured-data-quick-start-guide.md) for a full end-to-end 
walk-through.
 
 ### Query Indexes and Timeline
 
diff --git a/website/docs/storage_layouts.md b/website/docs/storage_layouts.md
index a88f584e3f9d..b069c914d5b1 100644
--- a/website/docs/storage_layouts.md
+++ b/website/docs/storage_layouts.md
@@ -28,9 +28,101 @@ base file formats.
 * columnar formats for vectorized reads, columnar compression and efficient 
column based access for analytics/data science.
 * row-oriented avro files for fast scans for reading whole records.
 * random access optimized HFiles for efficient searching for indexed records 
(based on 
[SSTable](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats)
 format)
+* [Lance](#lance-base-file-format) files for native `VECTOR` storage and AI/ML 
workloads (Spark-only).
 
 ![Row vs Columnar File 
Format](/assets/images/row-col-based-base-file-formats.png)
 
+#### Lance base file format {#lance-base-file-format}
+
+[Lance](https://lancedb.github.io/lance/) is a pluggable base file format 
selected per table via
+`hoodie.table.base.file.format = 'lance'`. Hudi manages the table layer
+(timeline, metadata, schema, file groups, table services); Lance is the 
on-disk file format for
+base files. Log files for MOR tables remain Avro; log compaction merges Avro 
logs into Lance base
+files.
+
+```sql
+-- COW
+CREATE TABLE my_ai_table (
+    id        STRING,
+    embedding VECTOR(768),
+    metadata  STRING
+) USING hudi
+TBLPROPERTIES (
+    primaryKey = 'id',
+    type = 'cow',
+    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
+    hoodie.table.base.file.format = 'lance'
+);
+
+-- MOR (Lance base + Avro logs)
+CREATE TABLE my_ai_table_mor (
+    id        STRING,
+    embedding VECTOR(768),
+    metadata  STRING
+) USING hudi
+TBLPROPERTIES (
+    primaryKey = 'id',
+    type = 'mor',
+    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
+    hoodie.table.base.file.format = 'lance'
+);
+```
+
+Hudi table services on Lance-backed tables behave as follows. Compaction 
merges Avro log files
+into Lance base files. Clustering reorganizes records into new Lance files. 
Cleaning removes
+obsolete Lance file slices. Bloom filter indexing is supported; column-stats 
and partition-stats
+indices are automatically disabled for Lance base files. See 
[Indexes](indexes.md) for the
+supported set.
+
+Type-specific behavior on Lance:
+
+- `VECTOR` columns are stored natively as Lance 
`FixedSizeList<Float32/Float64, dim>` (FLOAT or
+  DOUBLE only; `INT8` is not supported on Lance and fails fast at write).
+- `BLOB` columns default to `DESCRIPTOR` read mode, same as Parquet.
+- `VARIANT` columns are not supported on Lance. Writing a table with a VARIANT 
column to Lance
+  throws `HoodieNotSupportedException`. Use Parquet for VARIANT tables.
+- Complex types (`STRUCT`, `ARRAY`, `MAP`) are supported as Lance columns.
+- `populateMetaFields=false` is supported. User-defined key generators work 
normally with
+  Lance-backed tables.
+
+Lance is Spark-only. Reading a Lance-backed table from Flink, Hive, Presto, or 
Trino throws
+`HoodieValidationException`. Lance files are also non-splittable: a single 
Spark task reads one
+Lance base file.
+
+The Lance JAR is not bundled in the Hudi distribution. Add the Lance Spark 
bundle that matches
+your Spark version to the Spark classpath:
+
+| Spark version | Bundle (Maven Central) |
+|:--------------|:-----------------------|
+| Spark 3.4 | `org.lance:lance-spark-bundle-3.4_2.12:0.4.0` |
+| Spark 3.5 | `org.lance:lance-spark-bundle-3.5_2.12:0.4.0` |
+| Spark 4.0 | `org.lance:lance-spark-bundle-4.0_2.13:0.4.0` |
+| Spark 4.1 | `org.lance:lance-spark-bundle-4.1_2.13:0.4.0` |
+
+```bash
+export LANCE_BUNDLE_JAR=/path/to/lance-spark-bundle-3.5_2.12-0.4.0.jar
+spark-shell --jars $HUDI_BUNDLE_JAR,$LANCE_BUNDLE_JAR
+```
+
+##### File sizing and memory
+
+| Property | Default | Description |
+|:---------|:--------|:------------|
+| `hoodie.lance.max.file.size` | `125829120` (120 MiB) | Target file size in 
bytes for Lance base files (analogous to `hoodie.parquet.max.file.size`). |
+| `hoodie.lance.write.allocator.size.bytes` | `268435456` (256 MiB) | Maximum 
size of the Arrow child allocator used for buffering in-flight batch data. 
Increase for tables with very large `BLOB` columns. |
+| `hoodie.lance.write.flush.byte.watermark` | `100663296` (96 MiB) | Byte-size 
threshold at which the current write batch is flushed. Must be less than 
`hoodie.lance.write.allocator.size.bytes`. |
+
+Arrow uses power-of-2 buffer doubling; the default 256 MiB allocator 
accommodates a 128 MiB doubling
+step with headroom. The default 96 MiB watermark (~3/8 of the allocator cap) 
leaves room for offset
+and validity buffers to double without exceeding the allocator limit. For 
tables with large `BLOB`
+columns, increase the allocator and watermark together (keep watermark ≈ 3/8 
of allocator).
+
+##### Mixed-format tables
+
+`hoodie.table.base.file.format` is set per table, so different tables in the 
same lakehouse can use
+different base file formats (Parquet, ORC, HFile, Lance) under a shared Hudi 
catalog and metadata
+table.
+
 ### Log Files 
 
 Log files store incremental changes (partial or full) to a base file, such as 
updates, inserts, and deletes, after the base file was created. 
diff --git a/website/docs/syncing_metastore.md 
b/website/docs/syncing_metastore.md
index 059f172de003..d857affd8178 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -327,6 +327,27 @@ If the table is synced with `mode=hms` or `mode=hiveql` 
against HMS 4.x, Hudi lo
 
 **JDBC connection failures surface separately.** With `mode=jdbc`, Hudi opens 
the JDBC connection eagerly when the sync client is constructed — before any 
Thrift call is attempted. A bad JDBC URL, missing driver, or wrong credentials 
therefore fails at startup with `HoodieHiveSyncException: Failed to create 
HiveMetaStoreClient` and the underlying JDBC exception as the cause in the 
stack trace. This is a configuration-error path, not an HMS API mismatch, and 
is the same behavior as `mode= [...]
 
+## Metastore mapping for VECTOR, BLOB, and VARIANT
+
+When the column types ([`VECTOR`](sql_ddl.md#vector), 
[`BLOB`](sql_ddl.md#blob),
+[`VARIANT`](sql_ddl.md#variant)) are synced to an external catalog, Hudi maps 
them to the catalog's
+native binary/struct types:
+
+| Type | Hive | BigQuery |
+|:-----|:-----|:---------|
+| `VECTOR` | `BINARY` | `BYTES` |
+| `BLOB` | `STRUCT<type:STRING, data:BINARY, 
reference:STRUCT<external_path:STRING, offset:BIGINT, length:BIGINT, 
managed:BOOLEAN>>` | Equivalent `STRUCT` fields |
+| `VARIANT` | `STRUCT<metadata:BINARY, value:BINARY>` | `STRUCT` with 
`metadata` and `value` fields (`BYTES`) |
+
+For `VECTOR`, the original `VECTOR(dim, elementType)` dimension and 
element-type metadata is
+preserved in `TBLPROPERTIES`/table descriptions so the table can be correctly 
reconstructed by
+Spark after a metastore round-trip.
+
+Engines that support `VARIANT` natively (Spark 4.0+, Flink 2.1+) read the 
table directly using the
+Parquet VARIANT annotation and do not go through the Hive/BigQuery metastore 
representation. The
+`read_blob()` SQL function is Spark-only. Hive and BigQuery queries on a BLOB 
column read the
+underlying struct directly.
+
 ## Writer Version Table Property
 
 Hudi 1.2.0 sync writes the table property `hudi_writer_version` (set to the 
Hudi version that last synced the table) to the Hive metastore entry on every 
sync. This allows tooling and metastore administrators to identify which Hudi 
version wrote a given table.
diff --git a/website/docs/ai-quick-start-guide.md 
b/website/docs/unstructured-data-quick-start-guide.md
similarity index 74%
rename from website/docs/ai-quick-start-guide.md
rename to website/docs/unstructured-data-quick-start-guide.md
index 9d8fa50ea16f..c0014ba815dc 100644
--- a/website/docs/ai-quick-start-guide.md
+++ b/website/docs/unstructured-data-quick-start-guide.md
@@ -1,21 +1,23 @@
 ---
-title: "AI Quick Start"
-keywords: [ hudi, ai, vector search, embeddings, unstructured data, blob, 
machine learning, image search, similarity]
-summary: "Store embeddings and images in a Hudi table, then find similar 
images with one SQL query using hudi_vector_search and read_blob"
+title: "Unstructured Data Quick Start"
+keywords: [ hudi, ai, machine learning, vector search, embeddings, 
unstructured data, blob, image search, similarity]
+summary: "Store embeddings and images in a Hudi table, then find similar 
images with hudi_vector_search and read_blob"
 toc: true
 last_modified_at: 2026-04-29T00:00:00-00:00
 ---
 
-This guide shows how Hudi 1.2.0 brings AI-native data types to the lakehouse. 
You will store image
-embeddings (`VECTOR`) and raw image bytes (`BLOB`) in a single Hudi table, 
then run a similarity
-search that finds the top-K nearest neighbors **and** materializes their 
images — in one SQL query.
+AI and machine learning pipelines (RAG, recommendation, multimodal search) 
need to store and query
+embeddings, raw bytes, and structured metadata side by side. Hudi's `VECTOR` 
and `BLOB` column
+types let you keep all of those in a single transactional table. This guide 
walks through them end
+to end: you will store image embeddings (`VECTOR`) and raw image bytes 
(`BLOB`) in a single Hudi
+table, then run a top-K similarity search and materialize the matching images 
in one SQL query.
 
 :::tip
 Want to try this locally? This guide is also available as an interactive 
Jupyter notebook.
 [Download the 
notebook](https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb)
 and run it end-to-end on your machine.
 :::
 
-![Vector search results — query image on the left, top-5 nearest neighbors on 
the right](/assets/images/hudi_vector_search_results.jpg)
+![Vector search results: query image on the left, top-5 nearest neighbors on 
the right](/assets/images/hudi_vector_search_results.jpg)
 
 *Example output: a query image of a German Shorthaired pointer (left) and the 
five most similar images found by `hudi_vector_search`, with cosine similarity 
scores. Raw image bytes are materialized by `read_blob()` directly in the same 
query.*
 
@@ -45,7 +47,7 @@ HUDI_JAR = os.getenv("HUDI_BUNDLE_JAR", 
"hudi-spark3.5-bundle_2.12-1.2.0.jar")
 
 spark = (
     SparkSession.builder
-    .appName("Hudi-AI-QuickStart")
+    .appName("Hudi-Unstructured-Data-QuickStart")
     .config("spark.jars", HUDI_JAR)
     .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
     .config("spark.sql.extensions",
@@ -105,7 +107,7 @@ DIM = feats.shape[1]  # 1024
 
 ## 3. Create table and insert data
 
-Declare `VECTOR(1024)` for embeddings and `BLOB` for raw image bytes — 
first-class Hudi column types.
+Declare `VECTOR(1024)` for embeddings and `BLOB` for raw image bytes.
 
 ```python
 from pyspark.sql import Row
@@ -155,9 +157,9 @@ SELECT image_id, category, label,
 FROM staging;
 ```
 
-Key points:
-- **`VECTOR(1024)`** stores fixed-dimension embeddings for similarity search.
-- **`BLOB`** stores raw image bytes inline. For large objects, use 
`OUT_OF_LINE` to store a pointer instead — `read_blob()` resolves both modes 
transparently.
+Notes:
+- `VECTOR(1024)` stores fixed-dimension embeddings for similarity search.
+- `BLOB` stores raw image bytes inline. For large objects, use `OUT_OF_LINE` 
to store a pointer instead. `read_blob()` resolves both modes transparently.
 
 ## 4. Materialize a BLOB with `read_blob()`
 
@@ -183,13 +185,14 @@ LIMIT 5;
 +-----------+--------------------+----------+
 ```
 
-![A sample image retrieved via read_blob — a sleeping 
Beagle](/assets/images/hudi_read_blob_sample.jpg)
+![A sample image retrieved via read_blob, a sleeping 
Beagle](/assets/images/hudi_read_blob_sample.jpg)
 
-*Image bytes retrieved by `read_blob()`, decoded back to a PNG — confirming 
lossless round-trip through the Hudi BLOB column.*
+*Image bytes retrieved by `read_blob()`, decoded back to a PNG. The round-trip 
through the Hudi BLOB column is lossless.*
 
 ## 5. Vector search + BLOB retrieval in one query
 
-The headline: `hudi_vector_search` finds the top-K nearest neighbors by cosine 
similarity, and `read_blob()` materializes image bytes **only for the matching 
rows**.
+`hudi_vector_search` returns the top-K nearest neighbors by cosine similarity; 
`read_blob()`
+materializes image bytes only for the matching rows.
 
 ```sql
 SELECT image_id,
@@ -220,9 +223,7 @@ ORDER BY _hudi_distance;
 
 ![Vector search results panel](/assets/images/hudi_vector_search_results.jpg)
 
-*Query: German Shorthaired pointer (left). Top-5 results ranked by cosine 
similarity — the search correctly identifies same-breed images as the closest 
matches.*
-
-One SQL query. No external vector database. Embeddings, raw images, and 
metadata all live in one transactional Hudi table.
+*Query: German Shorthaired pointer (left). Top-5 results ranked by cosine 
similarity.*
 
 ## 6. Visualize results
 
@@ -253,8 +254,8 @@ plt.savefig("hudi_vector_search_results.png", dpi=150)
 | Topic | Link |
 |:------|:-----|
 | Full interactive notebook | 
[00_main_demo.ipynb](https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb)
 |
-| VECTOR type reference | [Vector Search](vector_search.md) — element types, 
batch TVF, distance metrics |
-| BLOB type reference | [Unstructured Data](blob_unstructured_data.md) — 
inline vs. out-of-line, `read_blob()`, configs |
-| VARIANT type | [Semi-Structured Data](variant_type.md) — flexible JSON-like 
storage for LLM outputs, model metadata |
-| Lance file format | [Lance File Format](lance_file_format.md) — 
vector-optimized storage |
-| AI overview | [AI-Native Lakehouse](ai_overview.md) — architecture and use 
cases |
+| VECTOR type reference | [VECTOR in SQL DDL](sql_ddl.md#vector) · [SQL 
DML](sql_dml.md#inserting-vector-columns) · [DataFrame 
writes](writing_data.md#vector-via-dataframe) · [`hudi_vector_search` in SQL 
Queries](sql_queries.md#vector-similarity-search) |
+| BLOB type reference | [BLOB in SQL DDL](sql_ddl.md#blob) · [SQL 
DML](sql_dml.md#inserting-blob-columns) · [DataFrame 
writes](writing_data.md#blob-via-dataframe) · [`read_blob()` in SQL 
Queries](sql_queries.md#reading-blob-columns) |
+| VARIANT type | [VARIANT in SQL DDL](sql_ddl.md#variant) · [SQL 
DML](sql_dml.md#inserting-variant-columns) · [DataFrame 
writes](writing_data.md#variant-via-dataframe) · [Querying 
VARIANT](sql_queries.md#querying-variant-columns) |
+| Lance file format | [Storage Layouts → 
Lance](storage_layouts.md#lance-base-file-format) · [DataFrame 
writes](writing_data.md#lance-base-file-format-via-dataframe) |
+| AI lakehouse use cases | [Use Cases → AI 
Lakehouse](use_cases.md#ai-lakehouse) |
diff --git a/website/docs/use_cases.md b/website/docs/use_cases.md
index 251df6187611..d0b6cecdee11 100644
--- a/website/docs/use_cases.md
+++ b/website/docs/use_cases.md
@@ -97,7 +97,42 @@ For the more curious, a more detailed explanation of the 
benefits of _incrementa
 - The idea with MoR is to reduce write costs/latencies, by writing delta logs 
(Hudi), positional delete files (iceberg). Hudi employs about 4 types of 
indexing to quickly locate the file that the updates records belong to. Formats 
relying on a scan of the table can quickly bottleneck on write performance. e.g 
updating 1GB into a 1TB table every 5-10 mins.
 - Hudi is the only lakehouse storage system that natively supports event time 
ordering and late data handling for streaming workloads where MoR is employed 
heavily. 
 
+## AI-Native Data Lakehouse
 
+The rise of large language models (LLMs) and multimodal AI has sparked a 
fundamental paradigm shift in data processing, moving the data lakehouse's 
primary workload from traditional ETL/BI analytics toward feature engineering, 
model training, and Retrieval-Augmented Generation (RAG). Traditional data 
lakehouses were natively designed around structured, tabular data, leaving 
modern AI applications awkwardly fragmented. These advanced workloads requires 
seamlessly combining structured meta [...]
 
+Oftentimes, managing this multi-modal reality forced engineering teams to 
maintain separate, siloed infrastructure. Raw binaries were stored in unmanaged 
object stores, while embeddings were synced out to standalone, external vector 
databases. This fragmented architecture introduced a severe "sync dilemma": it 
creates multiple data copies, drives up operational infrastructure costs, 
suffers from a lack of real-time responsiveness, and entirely breaks 
transactional consistency across the  [...]
 
+Apache Hudi delivers an AI-native lakehouse architecture by supporting 
multi-modal data types within the same table schema. Large binary objects 
([`BLOB`](sql_ddl.md#blob)), dense semantic embeddings 
([`VECTOR`](sql_ddl.md#vector)), and flexible semi-structured payloads 
([`VARIANT`](sql_ddl.md#variant)) reside natively within the same Hudi data 
files. By managing multi-modal data inside the same lakehouse rather than 
separate infrastructure, unstructured data inherits Hudi's ACID transac [...]
 
+The table below serves as the technical entry point for interacting with 
AI-native data types in Hudi:
+
+| Type | Declare | Write | Read |
+|:-----|:--------|:------|:-----|
+| [`VECTOR`](sql_ddl.md#vector) | [DDL](sql_ddl.md#vector) | [SQL 
DML](sql_dml.md#inserting-vector-columns) · 
[DataFrame](writing_data.md#vector-via-dataframe) | [`hudi_vector_search` 
TVF](sql_queries.md#vector-similarity-search) |
+| [`BLOB`](sql_ddl.md#blob) | [DDL](sql_ddl.md#blob) | [SQL 
DML](sql_dml.md#inserting-blob-columns) · 
[DataFrame](writing_data.md#blob-via-dataframe) | 
[`read_blob()`](sql_queries.md#reading-blob-columns) |
+| [`VARIANT`](sql_ddl.md#variant) | [DDL](sql_ddl.md#variant) | [SQL 
DML](sql_dml.md#inserting-variant-columns) · 
[DataFrame](writing_data.md#variant-via-dataframe) | [Querying 
VARIANT](sql_queries.md#querying-variant-columns) |
+| [Lance file format](storage_layouts.md#lance-base-file-format) | 
[TBLPROPERTIES](sql_ddl.md#lance-base-file-format) | 
[DataFrame](writing_data.md#lance-base-file-format-via-dataframe) | 
(transparent; Spark-only) |
+
+For a worked example covering image embeddings, raw bytes, and similarity 
search, see the
+[Unstructured Data Quick Start Guide](unstructured-data-quick-start-guide.md).
+
+| Use case | Types and features |
+|:---------|:-------------------|
+| Image / video search | `VECTOR` embeddings + `BLOB` raw frames + 
`hudi_vector_search` |
+| Retrieval-augmented generation (RAG) | `VECTOR` chunk embeddings retrieved 
to assemble LLM context |
+| LLM output management | `VARIANT` for response payloads + `VECTOR` for 
semantic indexing |
+| Recommendation systems | `VECTOR` similarity for collaborative filtering; 
incremental re-embedding via incremental queries |
+| Content moderation | `BLOB` raw content + `VECTOR` content embeddings + 
incremental processing |
+| Multimodal analytics | Structured metadata + `VECTOR` embeddings + `BLOB` 
raw data in one table |
+| ML feature store | `VECTOR` for feature embeddings, `VARIANT` for sparse 
feature maps; time-travel for point-in-time retrieval |
+| Experiment tracking | `VARIANT` for heterogeneous model configs and metrics; 
incremental queries for latest runs |
+| Data labeling pipelines | `BLOB` for raw data, incremental queries for 
unlabeled data, transactional label updates |
+
+### Why Hudi for AI workloads
+
+- Unified storage across modalities: Raw binary objects 
([`BLOB`](sql_ddl.md#blob)), semantic vector embeddings 
([`VECTOR`](sql_ddl.md#vector)), semi-structured payloads 
([`VARIANT`](sql_ddl.md#variant)), and standard analytical columns live 
together in a single Hudi table, under the same [timeline](timeline.md), 
[schema evolution](schema_evolution.md), and access controls.
+- Incremental embedding pipelines: Hudi's [incremental 
query](sql_queries.md#incremental-query) lets feature pipelines re-embed only 
the rows that changed since a given commit, instead of reprocessing the full 
batch.
+- Multi-modal transactional guarantees: Vector updates, binary changes, and 
metadata modifications commit atomically. A similarity query joining the 
`VECTOR` column with the `BLOB` column never observes a state in which the 
embedding refers to a deleted or stale binary asset.
+- Automated table services: [File sizing](file_sizing.md), 
[cleaning](cleaning.md), [compaction](compaction.md), and 
[clustering](clustering.md) apply to tables that contain `VECTOR`, `BLOB`, and 
`VARIANT` columns.
+- Open, multi-engine access: Hudi tables are written and read by [Apache 
Spark](quick-start-guide.md) and [Apache Flink](flink-quick-start-guide.md), 
and by the upcoming native [Python / Rust client 
(hudi-rs)](python-rust-quick-start-guide.md) reading the underlying Parquet 
representation directly.
diff --git a/website/docs/variant_type.md b/website/docs/variant_type.md
deleted file mode 100644
index fdcf6f54b54b..000000000000
--- a/website/docs/variant_type.md
+++ /dev/null
@@ -1,362 +0,0 @@
----
-title: "Semi-Structured Data (VARIANT)"
-keywords: [ hudi, variant, semi-structured, json, schemaless, shredding, 
parse_json, flexible schema]
-summary: "Store and query semi-structured JSON-like data in Hudi tables using 
the VARIANT type, with optional shredding for query performance"
-toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
----
-
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-The `VARIANT` type stores semi-structured, JSON-like data directly in Hudi 
tables. Unlike rigid schemas
-where every column and type must be declared upfront, VARIANT columns accept 
arbitrary nested structures —
-objects, arrays, scalars — and store them efficiently in a self-describing 
binary format.
-
-This is particularly useful for AI and data engineering workloads where data 
shapes evolve rapidly:
-model metadata, LLM outputs, feature dictionaries, API responses, and event 
payloads.
-
-## Overview
-
-VARIANT represents semi-structured data as two binary fields:
-
-| Field | Description |
-|:------|:------------|
-| `metadata` | Encodes field names, types, and structure for efficient access |
-| `value` | The actual data payload |
-
-This encoding supports any JSON-compatible value: objects, arrays, strings, 
numbers, booleans, and null.
-Unlike storing raw JSON strings, VARIANT provides type-aware binary storage 
with optional **shredding**
-for columnar query performance.
-
-## Creating Tables with VARIANT Columns
-
-<Tabs
-groupId="api-style"
-defaultValue="sql4"
-values={[
-{ label: 'Spark SQL (4.0+)', value: 'sql4', },
-{ label: 'Spark SQL (3.x)', value: 'sql3', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql4">
-
-Spark 4.0+ has native `VARIANT` type support:
-
-```sql
-CREATE TABLE events (
-    event_id  STRING,
-    payload   VARIANT,
-    ts        BIGINT
-) USING hudi
-TBLPROPERTIES (
-    primaryKey = 'event_id',
-    preCombineField = 'ts'
-);
-```
-
-</TabItem>
-<TabItem value="sql3">
-
-On Spark 3.x, use the struct representation with metadata tagging:
-
-```sql
-CREATE TABLE events (
-    event_id  STRING,
-    payload   STRUCT<value: BINARY, metadata: BINARY>,
-    ts        BIGINT
-) USING hudi
-TBLPROPERTIES (
-    primaryKey = 'event_id',
-    preCombineField = 'ts'
-);
-```
-
-Hudi recognizes this struct pattern and treats it as a logical VARIANT.
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-from pyspark.sql.types import *
-
-# Option 1: Spark 4.0+ native VariantType
-# schema = StructType([
-#     StructField("event_id", StringType()),
-#     StructField("payload", VariantType()),
-#     StructField("ts", LongType()),
-# ])
-
-# Option 2: Struct with metadata tag (works on Spark 3.x and 4.0+)
-from pyspark.sql.types import MetadataBuilder
-variant_metadata = MetadataBuilder() \
-    .putString("hudi_type", "VARIANT") \
-    .build()
-variant_struct = StructType([
-    StructField("metadata", BinaryType()),
-    StructField("value", BinaryType()),
-])
-schema = StructType([
-    StructField("event_id", StringType()),
-    StructField("payload", variant_struct, metadata=variant_metadata),
-    StructField("ts", LongType()),
-])
-```
-
-</TabItem>
-</Tabs>
-
-## Writing VARIANT Data
-
-<Tabs
-groupId="api-style"
-defaultValue="sql4"
-values={[
-{ label: 'Spark SQL (4.0+)', value: 'sql4', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql4">
-
-Use `parse_json()` to convert JSON strings into VARIANT values:
-
-```sql
-INSERT INTO events VALUES
-    ('evt_001', parse_json('{"action": "click", "x": 120, "y": 450}'), 1000),
-    ('evt_002', parse_json('{"action": "purchase", "items": ["sku_a", 
"sku_b"], "total": 59.99}'), 1001),
-    ('evt_003', parse_json('"simple string value"'), 1002);
-```
-
-VARIANT accepts any valid JSON: objects, arrays, strings, numbers, booleans, 
and null.
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-df = spark.sql("""
-    SELECT
-        'evt_001' as event_id,
-        parse_json('{"action": "click", "x": 120, "y": 450}') as payload,
-        1000 as ts
-""")
-
-df.write.format("hudi") \
-    .option("hoodie.table.name", "events") \
-    .option("hoodie.datasource.write.recordkey.field", "event_id") \
-    .option("hoodie.datasource.write.precombine.field", "ts") \
-    .mode("append") \
-    .save("/path/to/events")
-```
-
-</TabItem>
-</Tabs>
-
-## Querying VARIANT Data
-
-### Reading VARIANT as JSON
-
-Cast VARIANT to string for JSON output:
-
-```sql
-SELECT event_id, cast(payload as STRING) as payload_json
-FROM events;
-```
-
-```
-+----------+--------------------------------------------------------------+
-|  event_id|                                                  payload_json|
-+----------+--------------------------------------------------------------+
-|   evt_001|                    {"action":"click","x":120,"y":450}        |
-|   evt_002| {"action":"purchase","items":["sku_a","sku_b"],"total":59.99}|
-|   evt_003|                                      "simple string value"   |
-+----------+--------------------------------------------------------------+
-```
-
-### Full DML Support
-
-VARIANT columns support all standard DML operations:
-
-```sql
--- UPDATE
-UPDATE events SET payload = parse_json('{"action": "click", "x": 200}')
-WHERE event_id = 'evt_001';
-
--- DELETE
-DELETE FROM events WHERE event_id = 'evt_003';
-
--- MERGE
-MERGE INTO events target
-USING new_events source ON target.event_id = source.event_id
-WHEN MATCHED THEN UPDATE SET payload = source.payload, ts = source.ts
-WHEN NOT MATCHED THEN INSERT *;
-```
-
-### Table Type Support
-
-VARIANT works with both COW (Copy-on-Write) and MOR (Merge-on-Read) table 
types. MOR tables
-generate log files for VARIANT changes, which are merged during compaction.
-
-## VARIANT Shredding
-
-Shredding extracts frequently-accessed fields from VARIANT into dedicated 
typed Parquet columns. This
-gives columnar query performance for known fields while retaining the 
flexibility of VARIANT for
-the rest.
-
-### Unshredded (Default)
-
-```
-group variant {
-  required binary metadata;
-  required binary value;
-}
-```
-
-All data is in the opaque binary fields. Every query must decode the full 
VARIANT.
-
-### Shredded (Scalar)
-
-```
-group variant {
-  required binary metadata;
-  optional binary value;
-  optional int64  typed_value;   -- extracted scalar field
-}
-```
-
-The `typed_value` column stores a typed extraction that Spark can read 
directly without decoding
-the binary payload.
-
-### Shredded (Object)
-
-```
-group variant {
-  required binary metadata;
-  optional binary value;
-  optional group typed_value {
-    optional group action {
-      optional binary value;
-      optional string typed_value;    -- "click", "purchase", etc.
-    }
-    optional group total {
-      optional binary value;
-      optional double typed_value;    -- 59.99
-    }
-  }
-}
-```
-
-Each known field gets its own sub-column. Fields not present in a given row 
fall back to the
-binary `value` field.
-
-## Cross-Engine Compatibility
-
-| Engine | VARIANT Support |
-|:-------|:---------------|
-| **Spark 4.0** | Native `VariantType` — full read/write/query for COW and 
MOR; native `df.write` with `VariantType` on the V1 datasource |
-| **Spark 4.1** | Native `VariantType` — full read/write/query for COW and MOR 
|
-| **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — 
backward compatible |
-| **Flink** | Native `VARIANT` operations are not supported. Tables written by 
Spark with VARIANT columns can be read in Flink only as the underlying 
`ROW<metadata BYTES, value BYTES>` struct. |
-
-A VARIANT table written by Spark 4.0/4.1 can be read by Spark 3.x using the 
underlying binary struct, or by Flink as `ROW<metadata BYTES, value BYTES>`. 
The binary encoding is engine-independent.
-
-## Metastore Sync
-
-When syncing VARIANT column schemas to external catalogs, Hudi maps the binary 
encoding to the
-target catalog's native struct type:
-
-| Catalog | VARIANT representation |
-|:--------|:----------------------|
-| Hive | `STRUCT<metadata:BINARY, value:BINARY>` |
-| BigQuery | `STRUCT` with `metadata` and `value` fields (`BYTES` type) |
-
-Query engines that support VARIANT (Spark 4.0+, Flink 2.1+) read the table 
directly using the
-Parquet VARIANT annotation and do not go through the Hive/BigQuery metastore 
representation.
-
-## Use Cases for AI Workloads
-
-### LLM Output Storage
-
-Store raw LLM responses with varying structures:
-
-```sql
-CREATE TABLE llm_outputs (
-    request_id  STRING,
-    model       STRING,
-    response    VARIANT,
-    tokens_used INT,
-    ts          BIGINT
-) USING hudi TBLPROPERTIES (primaryKey = 'request_id', preCombineField = 'ts');
-
-INSERT INTO llm_outputs VALUES (
-    'req_001', 'claude-sonnet',
-    parse_json('{"text": "...", "stop_reason": "end_turn", "usage": {"input": 
500, "output": 200}}'),
-    700, 1000
-);
-```
-
-### Model Metadata & Experiment Tracking
-
-Store heterogeneous model configurations and metrics:
-
-```sql
-CREATE TABLE experiments (
-    run_id        STRING,
-    model_config  VARIANT,    -- hyperparameters, architecture, etc.
-    metrics       VARIANT,    -- loss, accuracy, custom metrics
-    ts            BIGINT
-) USING hudi TBLPROPERTIES (primaryKey = 'run_id', preCombineField = 'ts');
-```
-
-### Feature Dictionaries
-
-Store sparse or variable-length feature maps for ML:
-
-```sql
-CREATE TABLE user_features (
-    user_id    STRING,
-    features   VARIANT,       -- {"age": 25, "interests": [...], "embedding": 
[...]}
-    updated_at BIGINT
-) USING hudi TBLPROPERTIES (primaryKey = 'user_id', preCombineField = 
'updated_at');
-```
-
-### API Response Archival
-
-Ingest and store API responses with evolving schemas:
-
-```sql
-CREATE TABLE api_responses (
-    request_id  STRING,
-    endpoint    STRING,
-    status_code INT,
-    body        VARIANT,
-    ts          BIGINT
-) USING hudi TBLPROPERTIES (primaryKey = 'request_id', preCombineField = 'ts');
-```
-
-## Best Practices
-
-1. **Use VARIANT for evolving schemas** — When the data shape changes 
frequently or varies across
-   records, VARIANT avoids constant schema migrations.
-
-2. **Prefer typed columns for frequently-queried fields** — If you always 
filter on a specific field,
-   make it a top-level typed column rather than burying it in VARIANT. 
Combine: use typed columns for
-   stable fields and VARIANT for the rest.
-
-3. **Consider shredding for hot fields** — If certain VARIANT fields are 
queried heavily, shredding
-   extracts them into columnar storage for better performance.
-
-4. **Use `parse_json()` for ingestion** — On Spark 4.0+, `parse_json()` is the 
standard way to create
-   VARIANT values from JSON strings.
-
-5. **Use `cast(v as STRING)` for export** — Convert VARIANT back to JSON for 
downstream consumers
-   that expect text.
-
-## Limitations
-
-- Native `VARIANT` keyword in DDL requires Spark 4.0+. On Spark 3.x, use the 
struct representation.
-- VARIANT shredding configuration is determined at write time based on the 
schema definition.
-- Complex path expressions within VARIANT may require casting to STRING and 
then using JSON functions.
-- Native VARIANT operations are not supported on Flink. VARIANT columns 
surface as `ROW<metadata BYTES, value BYTES>` and can be read but not natively 
decoded or queried as a variant.
-- VARIANT columns are **not supported** on Lance-backed tables. Use Parquet as 
the base file format
-  for tables containing VARIANT columns.
diff --git a/website/docs/vector_search.md b/website/docs/vector_search.md
deleted file mode 100644
index 4e6687057a6c..000000000000
--- a/website/docs/vector_search.md
+++ /dev/null
@@ -1,274 +0,0 @@
----
-title: "Vector Search"
-keywords: [ hudi, vector, search, embeddings, similarity, cosine, nearest 
neighbor, VECTOR type]
-summary: "Store embedding vectors in Hudi tables and run vector similarity 
search using the VECTOR type and hudi_vector_search TVF"
-toc: true
-last_modified_at: 2026-05-27T00:00:00-00:00
----
-
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-Hudi's `VECTOR` type and `hudi_vector_search` table-valued function (TVF) 
bring native similarity search
-to the data lakehouse. Store embeddings alongside your structured data and 
query them with familiar Spark SQL —
-no external vector database required.
-
-## Storage Format
-
-VECTOR columns are stored in Parquet as `FIXED_LEN_BYTE_ARRAY` — a 
fixed-length binary encoding of the
-float array. Hudi stamps `hudi_type` metadata on the column so the Spark 
reader knows to decode the
-bytes back into a typed array.
-
-On **Lance** tables, VECTOR columns are stored natively as Lance 
`FixedSizeList<Float32/Float64, dim>`,
-so embeddings are written without conversion overhead at the file-format 
layer. See
-[Lance File Format](lance_file_format.md) for details.
-
-The `VECTOR(dim[, elementType])` DDL syntax works across Spark 3.4, 3.5, 4.0, 
and 4.1. Hudi's SQL
-parser normalizes `VECTOR(128, FLOAT)` to `VECTOR(128)` (FLOAT is the default 
element type).
-Nesting VECTOR inside STRUCT, ARRAY, or MAP is not supported.
-
-## VECTOR Type
-
-The `VECTOR(dim[, elementType])` type declares a column that stores 
fixed-dimensional embedding vectors.
-Dimension metadata enables the query engine to validate inputs and optimize 
search.
-
-### Element Types
-
-| Element Type | Description | Storage | Use Case |
-|:-------------|:------------|:--------|:---------|
-| **FLOAT** (default) | 32-bit float | `ArrayType(FloatType)` | Standard 
embeddings (OpenAI, Cohere, etc.) |
-| **DOUBLE** | 64-bit double | `ArrayType(DoubleType)` | High-precision 
scientific embeddings |
-| **INT8** / **BYTE** | 8-bit signed integer | `ArrayType(ByteType)` | 
Quantized embeddings for storage efficiency |
-
-```sql
--- Default (FLOAT)
-embedding VECTOR(768)
-
--- Explicit element types
-embedding VECTOR(768, FLOAT)
-embedding VECTOR(768, DOUBLE)
-embedding VECTOR(256, INT8)
-```
-
-### Declaring VECTOR Columns
-
-<Tabs
-groupId="api-style"
-defaultValue="sql"
-values={[
-{ label: 'Spark SQL', value: 'sql', },
-{ label: 'DataFrame API', value: 'dataframe', },
-]}
->
-<TabItem value="sql">
-
-```sql
-CREATE TABLE products (
-    product_id   STRING,
-    name         STRING,
-    description  STRING,
-    embedding    VECTOR(768)
-) USING hudi
-TBLPROPERTIES (
-    primaryKey = 'product_id',
-    type = 'cow',
-    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
-    hoodie.datasource.write.base.file.format = 'parquet'
-);
-```
-
-When using SQL DDL, Hudi's parser automatically stamps the `VECTOR(dim)` 
metadata on the column.
-
-</TabItem>
-<TabItem value="dataframe">
-
-```python
-import pyarrow as pa
-
-schema = pa.schema([
-    pa.field("product_id", pa.string()),
-    pa.field("name",       pa.string()),
-    pa.field("embedding",  pa.list_(pa.float32()),
-             metadata={b"hudi_type": b"VECTOR(768)"}),
-])
-```
-
-When using the DataFrame API, you must manually stamp `hudi_type` metadata on 
the column via PyArrow.
-This metadata is what distinguishes a `VECTOR` column from a regular array 
column.
-
-</TabItem>
-</Tabs>
-
-### Writing Vectors
-
-Vectors are written as arrays of floats. Both the DataFrame API and SQL accept 
standard array syntax:
-
-```sql
-INSERT INTO products VALUES (
-    'prod_001', 'Running Shoes', 'Lightweight trail runner',
-    ARRAY(0.123, -0.456, 0.789, ...)   -- 768 floats
-);
-```
-
-## hudi_vector_search TVF
-
-The `hudi_vector_search` table-valued function returns the `top_k` rows from a 
Hudi table whose
-VECTOR column is closest to a given query vector under a chosen distance 
metric.
-
-### Syntax
-
-```sql
-SELECT *
-FROM hudi_vector_search(
-    table_name,       -- STRING: name of the Hudi table
-    vector_column,    -- STRING: name of the VECTOR column
-    query_vector,     -- ARRAY<FLOAT>: the query embedding
-    top_k,            -- INT: number of nearest neighbors to return
-    [distance_metric], -- STRING: 'cosine' (default), 'l2', or 'dot_product'
-    [algorithm]        -- STRING: 'brute_force' (default)
-)
-```
-
-### Parameters
-
-| Parameter | Type | Default | Description |
-|:----------|:-----|:--------|:------------|
-| `table_name` | STRING | (required) | The Hudi table to search. Can be a 
registered table name or a path. |
-| `vector_column` | STRING | (required) | The name of the VECTOR column to 
search against. |
-| `query_vector` | ARRAY&lt;FLOAT&gt; | (required) | The query embedding. Must 
match the declared dimension and element type of the VECTOR column. |
-| `top_k` | INT | (required) | The number of nearest neighbors to return. Must 
be a positive integer. |
-| `distance_metric` | STRING | `'cosine'` | Distance metric: `'cosine'`, 
`'l2'`, or `'dot_product'`. |
-| `algorithm` | STRING | `'brute_force'` | Search algorithm. Currently only 
`'brute_force'` is supported. |
-
-### Return Schema
-
-The TVF returns all columns from the source table (excluding the embedding 
column) plus:
-
-| Column | Type | Description |
-|:-------|:-----|:------------|
-| `_hudi_distance` | DOUBLE | The computed distance between the query vector 
and each result. Lower values indicate greater similarity. |
-
-Results are ordered by `_hudi_distance` ascending — closest matches first.
-
-## hudi_vector_search_batch TVF
-
-For searching with multiple query vectors at once, use the batch variant:
-
-### Syntax
-
-```sql
-SELECT *
-FROM hudi_vector_search_batch(
-    corpus_table,           -- STRING: table to search
-    corpus_embedding_col,   -- STRING: VECTOR column in corpus
-    query_table,            -- STRING: table containing query vectors
-    query_embedding_col,    -- STRING: VECTOR column in query table
-    top_k,                  -- INT: neighbors per query
-    [distance_metric],      -- STRING: 'cosine' (default)
-    [algorithm]             -- STRING: 'brute_force' (default)
-)
-```
-
-### Return Schema (Batch)
-
-Returns corpus columns + query columns + distance info:
-
-| Column | Type | Description |
-|:-------|:-----|:------------|
-| `_hudi_distance` | DOUBLE | Distance between query and corpus vector |
-| `_hudi_query_index` | LONG | Index identifying which query vector produced 
this result |
-
-If corpus and query tables share column names, query columns are prefixed with 
`_hudi_query_`.
-
-### Distance Metrics
-
-| Metric | Formula | Range | Best for |
-|:-------|:--------|:------|:---------|
-| **cosine** | 1 - cos(a, b), clamped to [0, 2] | [0, 2] | Normalized 
embeddings (most common). Returns 1.0 for zero vectors. |
-| **l2** | sqrt(sum((a[i] - b[i])^2)) | [0, +inf) | Raw (unnormalized) 
embeddings |
-| **dot_product** | -(a &middot; b) | (-inf, +inf) | Maximum inner product 
search. Negated so ascending sort = most similar. |
-
-:::tip
-For best results with cosine distance, **L2-normalize your embeddings** before 
writing them to the table.
-Most embedding models (OpenAI, Cohere, sentence-transformers) output 
normalized vectors by default.
-If yours does not, normalize during ingestion:
-
-```python
-embedding = embedding / np.linalg.norm(embedding)
-```
-:::
-
-### Examples
-
-**Find similar products:**
-
-```sql
-SELECT product_id, name, _hudi_distance AS distance
-FROM hudi_vector_search(
-    'products', 'embedding',
-    ARRAY(0.12, -0.03, 0.87, ...),  -- query embedding
-    10,                               -- top 10
-    'cosine'
-)
-ORDER BY distance;
-```
-
-**RAG context retrieval:**
-
-```sql
--- Retrieve the 5 most relevant document chunks for an LLM prompt
-SELECT chunk_id, text_content, _hudi_distance
-FROM hudi_vector_search(
-    'document_chunks', 'embedding',
-    ARRAY(...),  -- embedding of the user's question
-    5, 'cosine'
-)
-WHERE _hudi_distance < 0.3;  -- optional distance threshold
-```
-
-**Cross-modal search (text-to-image):**
-
-```sql
--- Using CLIP embeddings, find images matching a text query
-SELECT image_id, caption, _hudi_distance
-FROM hudi_vector_search(
-    'image_catalog', 'clip_embedding',
-    ARRAY(...),  -- text embedding from CLIP
-    20, 'cosine'
-);
-```
-
-## Best Practices
-
-1. **Normalize embeddings** — Pre-normalize embeddings (L2 norm = 1) for 
cosine distance. This yields
-   more consistent results and slightly faster search.
-
-2. **Right-size your dimensions** — Higher dimensions capture more information 
but increase storage and
-   search cost. Many use cases work well with 384–1024 dimensions.
-
-3. **Use incremental processing** — When new data arrives, only embed and 
write the new records.
-   Hudi's incremental query capabilities make this straightforward.
-
-## Constraints
-
-- VECTOR columns must be **top-level fields** — nesting inside STRUCT, ARRAY, 
or MAP is not supported.
-- The query vector's element type must **exactly match** the corpus 
embedding's element type (no implicit casting).
-- VECTOR dimension and element type **cannot be changed** after table creation 
via schema evolution.
-- **Flink cannot read VECTOR columns.** VECTOR data is stored as Parquet 
`FIXED_LEN_BYTE_ARRAY`, which
-  Flink's Parquet reader does not decode back into a typed array. Flink can 
still read all **other**
-  columns in a table that contains a VECTOR column — only the VECTOR column 
itself is inaccessible.
-  Use Spark to query VECTOR columns.
-
-## Metastore Sync
-
-When syncing VECTOR column schemas to external catalogs, Hudi maps the binary 
encoding to the
-target catalog's native binary type, preserving the original VECTOR metadata 
in table properties:
-
-| Catalog | VECTOR representation |
-|:--------|:---------------------|
-| Hive | `BINARY` |
-| BigQuery | `BYTES` |
-
-The `VECTOR(dim, elementType)` dimension and element-type metadata is 
preserved in
-`TBLPROPERTIES`/table descriptions so the table can be correctly reconstructed 
by Spark even after
-a metastore round-trip.
diff --git a/website/docs/writing_data.md b/website/docs/writing_data.md
index dcdc5a33bd81..a0a3e1ca985c 100644
--- a/website/docs/writing_data.md
+++ b/website/docs/writing_data.md
@@ -418,6 +418,137 @@ Then any record you want to delete you can mark 
`_hoodie_is_deleted` as true:
 {"ts": 0.0, "uuid": "19tdb048-c93e-4532-adf9-f61ce6afe10", "rank": 1045, 
"partitionpath": "americas/brazil/sao_paulo", "_hoodie_is_deleted" : true}
 ```
 
+### Writing VECTOR, BLOB, and VARIANT Columns
+
+`VECTOR`, `BLOB`, and `VARIANT` columns can be written using SQL `INSERT`, see 
[SQL DML](sql_dml.md#inserting-vector-columns); the
+DataFrame API equivalents are below.
+
+#### VECTOR via DataFrame
+
+Add `hudi_type` metadata on the VECTOR column so the writer recognizes it:
+
+```python
+import pyarrow as pa
+
+schema = pa.schema([
+    pa.field("product_id", pa.string()),
+    pa.field("embedding",  pa.list_(pa.float32()),
+             metadata={b"hudi_type": b"VECTOR(768)"}),
+])
+```
+
+#### BLOB via DataFrame
+
+A BLOB column is internally a struct (see [BLOB](sql_ddl.md#blob)). Build it 
as a Spark `Row`:
+
+```python
+from pyspark.sql import Row
+
+with open("logo.png", "rb") as f:
+    raw_bytes = f.read()
+
+row = Row(
+    asset_id="asset_001",
+    file_name="logo.png",
+    mime_type="image/png",
+    file_size=len(raw_bytes),
+    content=Row(type="INLINE", data=raw_bytes, reference=None),
+)
+```
+
+For PyArrow schemas, declare the struct explicitly:
+
+```python
+import pyarrow as pa
+
+schema = pa.schema([
+    pa.field("asset_id",  pa.string()),
+    pa.field("file_name", pa.string()),
+    pa.field("mime_type", pa.string()),
+    pa.field("file_size", pa.int64()),
+    pa.field("content",   pa.struct([
+        pa.field("type",      pa.string()),
+        pa.field("data",      pa.binary()),
+        pa.field("reference", pa.struct([
+            pa.field("external_path", pa.string()),
+            pa.field("offset",        pa.int64()),
+            pa.field("length",        pa.int64()),
+            pa.field("managed",       pa.bool_()),
+        ])),
+    ]), metadata={b"hudi_type": b"BLOB"}),
+])
+```
+
+#### VARIANT via DataFrame
+
+VARIANT writes via the DataFrame API require Spark 4.0+. Use native 
`VariantType`:
+
+```python
+from pyspark.sql.types import StructType, StructField, StringType, LongType, 
VariantType
+
+schema = StructType([
+    StructField("event_id", StringType()),
+    StructField("payload",  VariantType()),
+    StructField("ts",       LongType()),
+])
+```
+
+Alternatively, declare the underlying struct and tag the outer field with 
`hudi_type=VARIANT`. The
+struct must contain exactly two **non-nullable** `BinaryType` fields named 
`metadata` and `value`,
+otherwise the writer throws `IllegalArgumentException: Invalid variant schema 
structure`. The column
+round-trips as native `VariantType` on read:
+
+```python
+from pyspark.sql.types import StructType, StructField, StringType, LongType, 
BinaryType, MetadataBuilder
+
+variant_metadata = MetadataBuilder().putString("hudi_type", "VARIANT").build()
+variant_struct = StructType([
+    StructField("metadata", BinaryType(), nullable=False),
+    StructField("value",    BinaryType(), nullable=False),
+])
+schema = StructType([
+    StructField("event_id", StringType()),
+    StructField("payload",  variant_struct, metadata=variant_metadata),
+    StructField("ts",       LongType()),
+])
+```
+
+The simplest path for constructing VARIANT values from JSON strings is to 
build the DataFrame via
+SQL and then write it:
+
+```python
+df = spark.sql("""
+    SELECT 'evt_001' AS event_id,
+           parse_json('{"action": "click", "x": 120, "y": 450}') AS payload,
+           1000 AS ts
+""")
+df.write.format("hudi") \
+    .option("hoodie.table.name", "events") \
+    .option("hoodie.datasource.write.recordkey.field", "event_id") \
+    .option("hoodie.datasource.write.precombine.field", "ts") \
+    .mode("append") \
+    .save("/path/to/table")
+```
+
+#### Lance base file format via DataFrame
+
+Set `hoodie.table.base.file.format=lance` on the write options:
+
+```python
+(df.write
+   .format("hudi")
+   .option("hoodie.table.name", "my_ai_table")
+   .option("hoodie.datasource.write.recordkey.field", "id")
+   .option("hoodie.record.merger.impls",
+           "org.apache.hudi.DefaultSparkRecordMerger")
+   .option("hoodie.table.base.file.format", "lance")
+   .mode("overwrite")
+   .save("/path/to/my_ai_table"))
+```
+
+See [Storage Layouts → Lance](storage_layouts.md#lance-base-file-format) for 
full Lance behavior
+and configs.
+
 ### Concurrency Control
 
 Following is an example of how to use `optimistic_concurrency_control` via 
Spark DataSource API.
@@ -467,6 +598,8 @@ The following advanced storage configuration options were 
added in Hudi 1.2.0:
 | Config | Default | Description |
 |---|---|---|
 | `hoodie.parquet.write.config.injector.class` | (none) | Fully-qualified 
class name of a custom `HoodieParquetConfigInjector` implementation. Use this 
to inject custom Parquet writer properties (e.g., disable dictionary encoding, 
set bloom filter sizes) without modifying the Hudi source. The implementing 
class must implement `org.apache.hudi.io.HoodieParquetConfigInjector`. |
+| `hoodie.table.base.file.format` | `parquet` | Base file format for the 
table. Accepts `parquet`, `orc`, `hfile`, or `lance`. See [Storage Layouts → 
Lance](storage_layouts.md#lance-base-file-format) for the Lance-specific 
options. |
+
 
 ## Java Client
 We can use plain java to write to hudi tables. To use Java client we can 
refere 
[here](https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-java/src/main/java/org/apache/hudi/examples/java/HoodieJavaWriteClientExample.java)
diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js
index f0e054a7c1e4..48d9028f3659 100644
--- a/website/docusaurus.config.js
+++ b/website/docusaurus.config.js
@@ -144,6 +144,10 @@ module.exports = {
             from: ["/docs/next/indexing"],
             to: "/docs/next/indexes",
           },
+          {
+            from: ["/docs/next/ai-quick-start-guide"],
+            to: "/docs/next/unstructured-data-quick-start-guide",
+          },
           {
             from: ["/docs/releases", "/docs/next/releases"],
             to: "/releases/release-1.1",
diff --git a/website/sidebars.js b/website/sidebars.js
index b4e8379b571b..36cffe3aead2 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -16,23 +16,12 @@ module.exports = {
                 'quick-start-guide',
                 'flink-quick-start-guide',
                 'python-rust-quick-start-guide',
-                'ai-quick-start-guide',
+                'unstructured-data-quick-start-guide',
                 'docker_demo',
                 "notebooks",
                 'use_cases',
             ],
         },
-        {
-            type: 'category',
-            label: 'AI & Unstructured Data',
-            items: [
-                'ai_overview',
-                'vector_search',
-                'blob_unstructured_data',
-                'variant_type',
-                'lance_file_format',
-            ],
-        },
         {
             type: 'category',
             label: 'Design & Concepts',

(hudi) branch asf-site updated: docs: Restructure and revise 1.2.0 Unstructured Data pages (#18876)

Reply via email to