This is an automated email from the ASF dual-hosted git repository.
voonhous pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 95be21048d69 docs(examples): pin blob.inline.mode=CONTENT after Lance
default flip (#18823)
95be21048d69 is described below
commit 95be21048d6980bbab1806aff46d62205f5d6be1
Author: Rahil C <[email protected]>
AuthorDate: Wed Jun 3 08:12:48 2026 -0700
docs(examples): pin blob.inline.mode=CONTENT after Lance default flip
(#18823)
* docs(examples): pin blob.inline.mode=CONTENT after Lance default flip
apache/hudi#18744 flipped Lance's default for `hoodie.read.blob.inline.mode`
to DESCRIPTOR and added a BatchedBlobReader guard that raises rather than
silently returns null when `read_blob()` runs against an INLINE row under
DESCRIPTOR mode. The vector_blob_demo blob-reader script and notebook were
relying on the prior implicit-CONTENT default for their `read_blob()`
resolve-view load, which now fails on Lance.
- hudi_blob_reader_demo.py / notebooks/01_blob_reader.ipynb: scope CONTENT
explicitly on the resolve-view reader (mirrors how show_descriptors()
already scopes its own mode per-load).
- notebooks/00_main_demo.ipynb: set CONTENT once on the SparkSession so the
notebook's "flip the DDL to lance" instruction continues to work.
- README.md + create_spark() comment: explain the Parquet/Lance default
split and reference apache/hudi#18744.
Verified with ./run_demos.sh against the 1.2.0-rc2 staging bundle
(all 10 parquet/lance × blob_reader/sql/dataframe combos green).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
* docs(examples): switch demos from 1.2.0-rc1 staging jar to official 1.2.0
Maven Central release
Replace all references to the Apache Nexus staging URL
(orgapachehudi-1176/1.2.0-rc1)
with the official Maven Central URL for hudi-spark3.5-bundle_2.12-1.2.0.jar
across
all three Python scripts, four notebooks, and both READMEs.
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
---
.../src/test/python/vector_blob_demo/README.md | 57 +++++++++++++---------
.../vector_blob_demo/hudi_blob_reader_demo.py | 50 +++++++++++--------
.../hudi_dataframe_vector_blob_demo.py | 16 +++---
.../vector_blob_demo/hudi_sql_vector_blob_demo.py | 16 +++---
.../vector_blob_demo/notebooks/00_main_demo.ipynb | 9 +++-
.../notebooks/01_blob_reader.ipynb | 45 ++++++++++-------
.../notebooks/02_sql_vector_search.ipynb | 14 +++---
.../notebooks/03_dataframe_vector_search.ipynb | 10 ++--
.../python/vector_blob_demo/notebooks/README.md | 2 +-
9 files changed, 128 insertions(+), 91 deletions(-)
diff --git
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md
index 11bbd11abd5c..ed1f0bdc94a5 100644
---
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md
+++
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md
@@ -71,28 +71,22 @@ See [`notebooks/README.md`](notebooks/README.md) for setup
details.
- Java 11
- Python **3.12** (PySpark 3.5 does NOT support Python 3.13/3.14)
-- Hudi Spark bundle (Apache 1.2.0-rc1 staging jar, or build from source)
+- Hudi Spark bundle (Apache Hudi 1.2.0 release jar, or build from source)
- Lance Spark bundle jar
## 1. Get the Hudi bundle
The scripts default `HUDI_BUNDLE_JAR` to
-`~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar`, so you can drop the
-Apache 1.2.0-rc1 staging jar there and skip exporting anything.
+`~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar`, so you can drop the
+Apache Hudi 1.2.0 release jar there and skip exporting anything.
-**Option A — Download the rc1 staging jar (recommended; no build required):**
+**Option A — Download the 1.2.0 release jar from Maven Central (recommended;
no build required):**
```bash
-curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar \
-
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
+curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \
+
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar
```
-This is the exact jar published to Apache's Nexus staging repo for the
-1.2.0-rc1 vote — running the demo against it doubles as smoke-testing the
-release candidate. Note: the staging URL (`orgapachehudi-1176`) rolls forward
-each RC; if you're reading this after rc1 closes, find the current staging
-repo at <https://repository.apache.org/#stagingRepositories>.
-
**Option B — Build from source:**
```bash
@@ -191,7 +185,7 @@ etc.) with similarity scores in the 0.3–0.5 range at N=100,
tighter at N=1000.
| Var | Default | Purpose |
|---|---|---|
-| `HUDI_BUNDLE_JAR` | `~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar`
(Apache 1.2.0-rc1 staging jar) | Hudi spark bundle. Override to point at a
locally built `*-SNAPSHOT.jar` if you go the Option B route. |
+| `HUDI_BUNDLE_JAR` | `~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar`
(Apache Hudi 1.2.0 release jar) | Hudi spark bundle. Override to point at a
locally built `*-SNAPSHOT.jar` if you go the Option B route. |
| `LANCE_BUNDLE_JAR` | `~/Downloads/lance-spark-bundle-3.5_2.12-0.4.0.jar`
(Maven Central) | Lance spark bundle. Used only when
`HUDI_BASE_FILE_FORMAT=lance`; Parquet runs skip it entirely. |
| `HUDI_BASE_FILE_FORMAT` | `lance` | Set to `parquet` to write Parquet base
files instead |
| `HUDI_BLOB_MODE` | `out_of_line` | Blob reader demo only. Set to `inline` to
embed PNG bytes directly in the Hudi table (no external container file) |
@@ -222,11 +216,21 @@ Look for:
`hoodie.read.blob.inline.mode` controls how INLINE blobs come back:
-- `CONTENT` (default) — `image_bytes.data` returns the raw bytes directly.
+- `CONTENT` — `image_bytes.data` returns the raw bytes directly.
- `DESCRIPTOR` — `image_bytes.data` is null; `image_bytes.reference.*` is
synthesized to point at the underlying base file (`.lance` for Lance
base files), and `read_blob(image_bytes)` materializes bytes lazily.
+Per-format **implicit defaults** as of Hudi 1.2.0
+([apache/hudi#18744](https://github.com/apache/hudi/pull/18744)): Parquet
+defaults to `CONTENT`, Lance defaults to `DESCRIPTOR`. The same release
+also added a strict guard in `BatchedBlobReader` that raises
+`IllegalStateException` if `read_blob()` is called on an INLINE row whose
+load is in DESCRIPTOR mode — what used to be a silent null is now a hard
+failure. Demos that target both formats therefore set the mode explicitly
+(see the SQL and DataFrame demos for the session-level pattern; the blob
+reader demo scopes it per-load).
+
The blob reader demo exposes this via `HUDI_INLINE_READ_MODE`:
```bash
@@ -238,12 +242,14 @@ HUDI_BLOB_MODE=inline HUDI_INLINE_READ_MODE=descriptor
python hudi_blob_reader_d
```
**Important wiring detail (matches
`TestLanceDataSource.testBlobInlineDescriptorMode`):**
-the `DESCRIPTOR` option is scoped to a single per-load read in
-`show_descriptors()`; `read_blob_and_save()` uses a separate default-mode
-load so `read_blob()` can actually materialize bytes. Setting
-`hoodie.read.blob.inline.mode=DESCRIPTOR` at the SparkSession level would
-make every read return `data=null`, including the read backing `read_blob()`,
-so it would also return null.
+the option is scoped per-load — `show_descriptors()` sets the user-selected
+mode on its inspection view, and `read_blob_and_save()` sets `CONTENT`
+explicitly on its own reader. The latter cannot rely on the format's
+implicit default because Lance's flipped to `DESCRIPTOR` in 1.2.0
+(see above); on that format, the new `BatchedBlobReader` guard would turn
+the silent-null behavior of older releases into a hard `IllegalStateException`.
+Setting `hoodie.read.blob.inline.mode=DESCRIPTOR` at the SparkSession level
+would similarly poison every read, including the one backing `read_blob()`.
The setting is a no-op for `HUDI_BLOB_MODE=out_of_line` — those rows are
already descriptors (no inline bytes to suppress); `read_blob()` always
@@ -462,8 +468,15 @@ Every Spark config line has a purpose:
`hoodie.read.blob.inline.mode` is intentionally **not** set on the session —
the blob reader demo scopes it per-load (see "Switching BLOB read mode"
-above) so that `read_blob()` can run against a default-mode load and
-materialize bytes.
+above): `show_descriptors()` picks up the user-chosen mode for its
+inspection view, and `read_blob_and_save()` opts into `CONTENT` explicitly
+on its own reader. Explicit is required because Lance's implicit default
+flipped to `DESCRIPTOR` in
+[apache/hudi#18744](https://github.com/apache/hudi/pull/18744) and the
+new `BatchedBlobReader` guard raises if `read_blob()` runs against a
+descriptor-mode load. The SQL and DataFrame demos take the simpler route
+of setting `CONTENT` once on the session (they don't need DESCRIPTOR
+anywhere in their flow).
### Section 2 — `load_dataset()`
diff --git
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_blob_reader_demo.py
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_blob_reader_demo.py
index 8e14296c6b4d..731098545602 100644
---
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_blob_reader_demo.py
+++
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_blob_reader_demo.py
@@ -36,7 +36,7 @@ shows INLINE blobs + vector search); this one shows
OUT_OF_LINE blobs +
`read_blob()`.
Env vars (shares the same conventions as the other demos):
- HUDI_BUNDLE_JAR (defaults to
~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar)
+ HUDI_BUNDLE_JAR (defaults to
~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar)
HUDI_BASE_FILE_FORMAT (default 'lance'; set to 'parquet' to use Parquet)
LANCE_BUNDLE_JAR (defaults to
~/Downloads/lance-spark-bundle-3.5_2.12-0.4.0.jar; only used when
HUDI_BASE_FILE_FORMAT=lance)
HUDI_BLOB_MODE (default 'out_of_line'; 'inline' stores bytes in the
Hudi table)
@@ -117,11 +117,11 @@ def ensure_dir(p: Path) -> None:
def default_hudi_bundle_jar() -> str:
- # Defaults to the Apache 1.2.0-rc1 staging jar in ~/Downloads/. Grab it
with:
- # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar \
- #
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
+ # Defaults to the Apache Hudi 1.2.0 release jar in ~/Downloads/. Grab it
with:
+ # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \
+ #
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar
# Override via HUDI_BUNDLE_JAR=/abs/path/to/jar to point at a locally
built bundle.
- return str(Path.home() / "Downloads" /
"hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar")
+ return str(Path.home() / "Downloads" /
"hudi-spark3.5-bundle_2.12-1.2.0.jar")
def default_lance_bundle_jar() -> str:
@@ -137,9 +137,9 @@ def resolve_jars() -> str:
if not Path(hudi_jar).is_file():
sys.exit(
f"ERROR: HUDI_BUNDLE_JAR does not exist at {hudi_jar}\n"
- "Download the Apache 1.2.0-rc1 staging jar with:\n"
- " curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
\\\n"
- "
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\n"
+ "Download the Apache Hudi 1.2.0 release jar with:\n"
+ " curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \\\n"
+ "
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar\n"
"or set HUDI_BUNDLE_JAR=/abs/path/to/locally-built.jar."
)
@@ -179,13 +179,14 @@ def create_spark() -> SparkSession:
"org.apache.spark.sql.hudi.catalog.HoodieCatalog",
)
.config("spark.sql.session.timeZone", "UTC")
- # NOTE: `hoodie.read.blob.inline.mode` is intentionally NOT set on the
SparkSession.
- # If it were, EVERY hudi load — including the one read_blob() runs
internally —
- # would suppress INLINE bytes, and read_blob() would return null.
Instead we scope
- # the DESCRIPTOR option per-load in show_descriptors() so read_blob()
in
- # read_blob_and_save() runs against a default-mode (CONTENT) load and
can
- # materialize bytes. See
TestLanceDataSource.testBlobInlineDescriptorMode for the
- # canonical pattern.
+ # NOTE: `hoodie.read.blob.inline.mode` is intentionally NOT set on the
SparkSession —
+ # session-wide DESCRIPTOR would suppress INLINE bytes on every load,
including the
+ # one read_blob() runs internally. We scope the option per-load
instead:
+ # show_descriptors() picks up the user-selected mode for its
inspection view, and
+ # read_blob_and_save() explicitly sets CONTENT on its own reader (it
cannot rely on
+ # the format's implicit default — apache/hudi#18744 flipped Lance to
default to
+ # DESCRIPTOR in 1.2.0, while Parquet stays at CONTENT). See
+ # TestLanceDataSource.testBlobInlineDescriptorMode for the canonical
pattern.
.config("spark.default.parallelism", "2")
.config("spark.sql.shuffle.partitions", "2")
)
@@ -474,12 +475,19 @@ def read_blob_and_save(spark: SparkSession):
f"(works regardless of inline_read_mode={CONFIG['inline_read_mode']}):"
)
- # IMPORTANT: register a fresh load WITHOUT the inline.mode option so the
underlying
- # read sees `data` populated (CONTENT mode). If we read from the
DESCRIPTORS_VIEW
- # registered in show_descriptors(), read_blob() would see data=null
because that
- # view was loaded in DESCRIPTOR mode — and BatchedBlobReader dispatches on
the row's
- # storage_type=INLINE before checking `reference`, so it would return null
bytes.
-
spark.read.format("hudi").load(CONFIG["table_path"]).createOrReplaceTempView(RESOLVE_VIEW)
+ # IMPORTANT: register a fresh load with
hoodie.read.blob.inline.mode=CONTENT so the
+ # underlying read sees `data` populated. Two things are at play:
+ # 1) Parquet's default for that option is CONTENT, but Lance's default
flipped to
+ # DESCRIPTOR in 1.2.0 (apache/hudi#18744) — relying on the implicit
default
+ # worked for Parquet but silently returned null bytes on Lance.
+ # 2) As of 1.2.0, BatchedBlobReader also raises IllegalStateException
when
+ # read_blob() is invoked on an INLINE row under DESCRIPTOR mode, so
the silent
+ # null is now a hard failure. Setting CONTENT explicitly here works
on both
+ # formats and survives any future default changes.
+ (spark.read.format("hudi")
+ .option("hoodie.read.blob.inline.mode", "CONTENT")
+ .load(CONFIG["table_path"])
+ .createOrReplaceTempView(RESOLVE_VIEW))
sql = f"""
SELECT image_id,
diff --git
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_dataframe_vector_blob_demo.py
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_dataframe_vector_blob_demo.py
index 5112d9caebb6..0d7ebea986c6 100644
---
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_dataframe_vector_blob_demo.py
+++
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_dataframe_vector_blob_demo.py
@@ -28,7 +28,7 @@ End-to-end flow:
5) Save the query image, top-K neighbors, and a combined panel figure.
Env vars:
- HUDI_BUNDLE_JAR (defaults to
~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar)
+ HUDI_BUNDLE_JAR (defaults to
~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar)
HUDI_BASE_FILE_FORMAT (default 'lance'; set to 'parquet' to use Parquet
base files)
LANCE_BUNDLE_JAR (defaults to
~/Downloads/lance-spark-bundle-3.5_2.12-0.4.0.jar;
only used when HUDI_BASE_FILE_FORMAT=lance)
@@ -126,11 +126,11 @@ def save_png_bytes(img_bytes: bytes, path: Path) -> None:
def default_hudi_bundle_jar() -> str:
- # Defaults to the Apache 1.2.0-rc1 staging jar in ~/Downloads/. Grab it
with:
- # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar \
- #
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
+ # Defaults to the Apache Hudi 1.2.0 release jar in ~/Downloads/. Grab it
with:
+ # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \
+ #
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar
# Override via HUDI_BUNDLE_JAR=/abs/path/to/jar to point at a locally
built bundle.
- return str(Path.home() / "Downloads" /
"hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar")
+ return str(Path.home() / "Downloads" /
"hudi-spark3.5-bundle_2.12-1.2.0.jar")
def default_lance_bundle_jar() -> str:
@@ -146,9 +146,9 @@ def resolve_jars() -> str:
if not Path(hudi_jar).is_file():
sys.exit(
f"ERROR: HUDI_BUNDLE_JAR does not exist at {hudi_jar}\n"
- "Download the Apache 1.2.0-rc1 staging jar with:\n"
- " curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
\\\n"
- "
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\n"
+ "Download the Apache Hudi 1.2.0 release jar with:\n"
+ " curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \\\n"
+ "
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar\n"
"or set HUDI_BUNDLE_JAR=/abs/path/to/locally-built.jar."
)
diff --git
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_sql_vector_blob_demo.py
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_sql_vector_blob_demo.py
index c36bd0cad56d..f4ff6b9316d8 100644
---
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_sql_vector_blob_demo.py
+++
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_sql_vector_blob_demo.py
@@ -29,7 +29,7 @@ Image loading (torchvision) and embedding generation (timm)
stay in Python —
those cannot be SQL. The bridge between the two is a Spark temp view.
Env vars (same as the DataFrame variant):
- HUDI_BUNDLE_JAR (defaults to
~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar)
+ HUDI_BUNDLE_JAR (defaults to
~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar)
HUDI_BASE_FILE_FORMAT (default 'lance'; set to 'parquet' to use Parquet)
LANCE_BUNDLE_JAR (defaults to
~/Downloads/lance-spark-bundle-3.5_2.12-0.4.0.jar; only used when
HUDI_BASE_FILE_FORMAT=lance)
HUDI_LANCE_DEMO_N (default 1000; number of images to ingest)
@@ -112,11 +112,11 @@ def save_png_bytes(img_bytes: bytes, path: Path) -> None:
def default_hudi_bundle_jar() -> str:
- # Defaults to the Apache 1.2.0-rc1 staging jar in ~/Downloads/. Grab it
with:
- # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar \
- #
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
+ # Defaults to the Apache Hudi 1.2.0 release jar in ~/Downloads/. Grab it
with:
+ # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \
+ #
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar
# Override via HUDI_BUNDLE_JAR=/abs/path/to/jar to point at a locally
built bundle.
- return str(Path.home() / "Downloads" /
"hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar")
+ return str(Path.home() / "Downloads" /
"hudi-spark3.5-bundle_2.12-1.2.0.jar")
def default_lance_bundle_jar() -> str:
@@ -132,9 +132,9 @@ def resolve_jars() -> str:
if not Path(hudi_jar).is_file():
sys.exit(
f"ERROR: HUDI_BUNDLE_JAR does not exist at {hudi_jar}\n"
- "Download the Apache 1.2.0-rc1 staging jar with:\n"
- " curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
\\\n"
- "
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\n"
+ "Download the Apache Hudi 1.2.0 release jar with:\n"
+ " curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \\\n"
+ "
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar\n"
"or set HUDI_BUNDLE_JAR=/abs/path/to/locally-built.jar."
)
diff --git
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb
index b9e68320fb67..c2dc058d9d5b 100644
---
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb
+++
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/00_main_demo.ipynb
@@ -27,7 +27,7 @@
"source": [
"## 1. Setup\n",
"\n",
- "Boots Spark with the Hudi rc1 + Lance bundles from `~/Downloads/`, wipes
`/tmp/` paths from prior runs."
+ "Boots Spark with the Hudi 1.2.0 + Lance bundles from `~/Downloads/`,
wipes `/tmp/` paths from prior runs."
]
},
{
@@ -105,7 +105,7 @@
"# === Resolve jars (defaults to ~/Downloads/) ===\n",
"def _default_jar(name): return str(Path.home() / \"Downloads\" / name)\n",
"\n",
- "HUDI_JAR = os.getenv(\"HUDI_BUNDLE_JAR\",
_default_jar(\"hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\"))\n",
+ "HUDI_JAR = os.getenv(\"HUDI_BUNDLE_JAR\",
_default_jar(\"hudi-spark3.5-bundle_2.12-1.2.0.jar\"))\n",
"LANCE_JAR = os.getenv(\"LANCE_BUNDLE_JAR\",
_default_jar(\"lance-spark-bundle-3.5_2.12-0.4.0.jar\"))\n",
"for jar in (HUDI_JAR, LANCE_JAR):\n",
" if not Path(jar).is_file():\n",
@@ -120,6 +120,11 @@
" .config(\"spark.sql.extensions\",
\"org.apache.spark.sql.hudi.HoodieSparkSessionExtension\")\n",
" .config(\"spark.sql.catalog.spark_catalog\",
\"org.apache.spark.sql.hudi.catalog.HoodieCatalog\")\n",
" .config(\"spark.sql.session.timeZone\", \"UTC\")\n",
+ " # Lance flipped its default for hoodie.read.blob.inline.mode to
DESCRIPTOR\n",
+ " # in apache/hudi#18744 (1.2.0); Parquet still defaults to CONTENT.\n",
+ " # Pinning CONTENT session-wide keeps read_blob() and
image_bytes.data\n",
+ " # working regardless of which base file format the DDL ends up
using.\n",
+ " .config(\"hoodie.read.blob.inline.mode\", \"CONTENT\")\n",
" .config(\"spark.default.parallelism\", \"2\")\n",
" .config(\"spark.sql.shuffle.partitions\", \"2\")\n",
" .config(\"spark.ui.showConsoleProgress\", \"false\")\n",
diff --git
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/01_blob_reader.ipynb
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/01_blob_reader.ipynb
index ea550c35d78f..952911426a5e 100644
---
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/01_blob_reader.ipynb
+++
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/01_blob_reader.ipynb
@@ -226,11 +226,11 @@
"from pathlib import Path\n",
"\n",
"def default_hudi_bundle_jar() -> str:\n",
- " # Defaults to the Apache 1.2.0-rc1 staging jar in ~/Downloads/. Grab
it with:\n",
- " # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
\\\n",
- " #
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\n",
+ " # Defaults to the Apache Hudi 1.2.0 release jar in ~/Downloads/. Grab
it with:\n",
+ " # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \\\n",
+ " #
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar\n",
" # Override via HUDI_BUNDLE_JAR env var to point at a locally built
bundle.\n",
- " return str(Path.home() / \"Downloads\" /
\"hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\")\n",
+ " return str(Path.home() / \"Downloads\" /
\"hudi-spark3.5-bundle_2.12-1.2.0.jar\")\n",
"\n",
"def default_lance_bundle_jar() -> str:\n",
" # Defaults to the Maven Central Lance 0.4.0 jar in ~/Downloads/. Grab
it with:\n",
@@ -243,7 +243,7 @@
" if not Path(hudi_jar).is_file():\n",
" sys.exit(\n",
" f\"ERROR: HUDI_BUNDLE_JAR does not exist at
{hudi_jar}\\n\"\n",
- " \"Download the Apache 1.2.0-rc1 staging jar to ~/Downloads/
\"\n",
+ " \"Download the Apache Hudi 1.2.0 release jar to ~/Downloads/
\"\n",
" \"or set HUDI_BUNDLE_JAR=/abs/path/to/locally-built.jar \"\n",
" \"before launching jupyter.\"\n",
" )\n",
@@ -295,11 +295,15 @@
],
"source": [
"# Note: `hoodie.read.blob.inline.mode` is intentionally NOT set on the\n",
- "# SparkSession. If it were, every Hudi load — including the one\n",
- "# `read_blob()` runs internally — would suppress INLINE bytes. Instead\n",
- "# the descriptor inspection step in cell 12 sets the option per-load,\n",
- "# and the `read_blob()` step in cell 13 runs against a separate\n",
- "# default-mode (CONTENT) load. See\n",
+ "# SparkSession — session-wide DESCRIPTOR would suppress INLINE bytes
on\n",
+ "# every load, including the one `read_blob()` runs internally. The\n",
+ "# option is scoped per-load instead: the descriptor inspection step\n",
+ "# (cell 13) picks up the user-chosen mode, and the `read_blob()` step\n",
+ "# (cell 14) explicitly sets CONTENT on its own reader. The explicit\n",
+ "# CONTENT is required because Lance's implicit default flipped to\n",
+ "# DESCRIPTOR in apache/hudi#18744 (1.2.0), while Parquet still\n",
+ "# defaults to CONTENT — and BatchedBlobReader now raises if read_blob\n",
+ "# runs against a descriptor-mode load. See\n",
"# `TestLanceDataSource.testBlobInlineDescriptorMode` for the canonical\n",
"# pattern.\n",
"jars = resolve_jars(CONFIG[\"base_file_format\"])\n",
@@ -677,12 +681,16 @@
"source": [
"## 14. `read_blob(image_bytes)` — materialize bytes on demand\n",
"\n",
- "Note the **separate, default-mode** Hudi load: `read_blob()`
dispatches\n",
- "on the row's `storage_type` and only consults `reference` for\n",
- "`OUT_OF_LINE`. For INLINE rows it reads the `data` field directly — so\n",
- "if the DESCRIPTORS_VIEW above (which suppresses bytes in DESCRIPTOR\n",
- "mode) were reused here, `read_blob()` would return null. Two views\n",
- "keeps both paths working."
+ "Note the **separate, explicit-CONTENT** Hudi load: `read_blob()`\n",
+ "dispatches on the row's `storage_type` and only consults `reference`\n",
+ "for `OUT_OF_LINE`. For INLINE rows it reads the `data` field directly\n",
+ "— so if the DESCRIPTORS_VIEW above (which suppresses bytes in\n",
+ "DESCRIPTOR mode) were reused here, `read_blob()` would either return\n",
+ "null (Parquet) or raise `IllegalStateException` (Lance, as of\n",
+ "[apache/hudi#18744](https://github.com/apache/hudi/pull/18744)).\n",
+ "Setting `hoodie.read.blob.inline.mode=CONTENT` explicitly here also\n",
+ "covers Lance's new DESCRIPTOR default — we don't rely on the\n",
+ "format's implicit default."
]
},
{
@@ -710,7 +718,10 @@
],
"source": [
"RESOLVE_VIEW = \"blob_resolve_view\"\n",
-
"spark.read.format(\"hudi\").load(CONFIG[\"table_path\"]).createOrReplaceTempView(RESOLVE_VIEW)\n",
+ "(spark.read.format(\"hudi\")\n",
+ " .option(\"hoodie.read.blob.inline.mode\", \"CONTENT\")\n",
+ " .load(CONFIG[\"table_path\"])\n",
+ " .createOrReplaceTempView(RESOLVE_VIEW))\n",
"\n",
"spark.sql(f\"\"\"\n",
" SELECT image_id,\n",
diff --git
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/02_sql_vector_search.ipynb
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/02_sql_vector_search.ipynb
index c41ae58c217f..8c451c75ad64 100644
---
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/02_sql_vector_search.ipynb
+++
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/02_sql_vector_search.ipynb
@@ -257,8 +257,8 @@
"bundle and (if `BASE_FILE_FORMAT == \"lance\"`) the Lance Spark
bundle.\n",
"\n",
"**Defaults:**\n",
- "- `HUDI_BUNDLE_JAR` →
`~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar`\n",
- " (Apache 1.2.0-rc1 staging jar)\n",
+ "- `HUDI_BUNDLE_JAR` →
`~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar`\n",
+ " (Apache Hudi 1.2.0 release jar)\n",
"- `LANCE_BUNDLE_JAR` →
`~/Downloads/lance-spark-bundle-3.5_2.12-0.4.0.jar`\n",
" (Maven Central)\n",
"\n",
@@ -279,11 +279,11 @@
"from pathlib import Path\n",
"\n",
"def default_hudi_bundle_jar() -> str:\n",
- " # Defaults to the Apache 1.2.0-rc1 staging jar in ~/Downloads/. Grab
it with:\n",
- " # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
\\\n",
- " #
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\n",
+ " # Defaults to the Apache Hudi 1.2.0 release jar in ~/Downloads/. Grab
it with:\n",
+ " # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \\\n",
+ " #
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar\n",
" # Override via HUDI_BUNDLE_JAR env var to point at a locally built
bundle.\n",
- " return str(Path.home() / \"Downloads\" /
\"hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\")\n",
+ " return str(Path.home() / \"Downloads\" /
\"hudi-spark3.5-bundle_2.12-1.2.0.jar\")\n",
"\n",
"def default_lance_bundle_jar() -> str:\n",
" # Defaults to the Maven Central Lance 0.4.0 jar in ~/Downloads/. Grab
it with:\n",
@@ -296,7 +296,7 @@
" if not Path(hudi_jar).is_file():\n",
" sys.exit(\n",
" f\"ERROR: HUDI_BUNDLE_JAR does not exist at
{hudi_jar}\\n\"\n",
- " \"Download the Apache 1.2.0-rc1 staging jar to ~/Downloads/
\"\n",
+ " \"Download the Apache Hudi 1.2.0 release jar to ~/Downloads/
\"\n",
" \"or set HUDI_BUNDLE_JAR=/abs/path/to/locally-built.jar \"\n",
" \"before launching jupyter.\"\n",
" )\n",
diff --git
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/03_dataframe_vector_search.ipynb
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/03_dataframe_vector_search.ipynb
index d4f94953a204..8d2e4046e2b0 100644
---
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/03_dataframe_vector_search.ipynb
+++
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/03_dataframe_vector_search.ipynb
@@ -197,11 +197,11 @@
"from pathlib import Path\n",
"\n",
"def default_hudi_bundle_jar() -> str:\n",
- " # Defaults to the Apache 1.2.0-rc1 staging jar in ~/Downloads/. Grab
it with:\n",
- " # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar
\\\n",
- " #
https://repository.apache.org/content/repositories/orgapachehudi-1176/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0-rc1/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\n",
+ " # Defaults to the Apache Hudi 1.2.0 release jar in ~/Downloads/. Grab
it with:\n",
+ " # curl -L -o ~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar \\\n",
+ " #
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.2.0/hudi-spark3.5-bundle_2.12-1.2.0.jar\n",
" # Override via HUDI_BUNDLE_JAR env var to point at a locally built
bundle.\n",
- " return str(Path.home() / \"Downloads\" /
\"hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar\")\n",
+ " return str(Path.home() / \"Downloads\" /
\"hudi-spark3.5-bundle_2.12-1.2.0.jar\")\n",
"\n",
"def default_lance_bundle_jar() -> str:\n",
" # Defaults to the Maven Central Lance 0.4.0 jar in ~/Downloads/. Grab
it with:\n",
@@ -214,7 +214,7 @@
" if not Path(hudi_jar).is_file():\n",
" sys.exit(\n",
" f\"ERROR: HUDI_BUNDLE_JAR does not exist at
{hudi_jar}\\n\"\n",
- " \"Download the Apache 1.2.0-rc1 staging jar to ~/Downloads/
\"\n",
+ " \"Download the Apache Hudi 1.2.0 release jar to ~/Downloads/
\"\n",
" \"or set HUDI_BUNDLE_JAR=/abs/path/to/locally-built.jar \"\n",
" \"before launching jupyter.\"\n",
" )\n",
diff --git
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md
index e6d66089eb73..e72fe7058b1b 100644
---
a/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md
+++
b/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md
@@ -37,7 +37,7 @@ pip install -r requirements.txt # adds jupyter + ipykernel
for this folder
```
The notebooks default `HUDI_BUNDLE_JAR` to
-`~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0-rc1.jar` and `LANCE_BUNDLE_JAR`
+`~/Downloads/hudi-spark3.5-bundle_2.12-1.2.0.jar` and `LANCE_BUNDLE_JAR`
to `~/Downloads/lance-spark-bundle-3.5_2.12-0.4.0.jar`, matching the `.py`
scripts. If you placed both jars in `~/Downloads/` per the parent
[`README.md`](../README.md) §1–2, you don't need to export anything. To