This is an automated email from the ASF dual-hosted git repository.
jongyoul pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/zeppelin.git
The following commit(s) were added to refs/heads/master by this push:
new 3e470121a0 [ZEPPELIN-6411] Semantic search for Zeppelin
3e470121a0 is described below
commit 3e470121a007616069b6a3c15ea338e450c6f54f
Author: Kalyan <[email protected]>
AuthorDate: Mon Jun 1 02:03:39 2026 -0700
[ZEPPELIN-6411] Semantic search for Zeppelin
### What is this PR for?
Added `EmbeddingSearch` — a new `SearchService` implementation that
enables natural language search across Zeppelin notebooks using ONNX-based
sentence embeddings (all-MiniLM-L6-v2).
Disabled by default, enabled with `zeppelin.search.semantic.enable = true`.
**The problem**:
Zeppelin's built-in search uses Lucene's keyword matching, which works well
for exact terms but falls short for the way analysts actually search.
A user looking for "yesterday's spending" gets zero results — even
though their notebooks contain SELECT sum(cost) WHERE date = current_date -
interval '1' day. The words don't match, so Lucene can't find it.
This PR adds EmbeddingSearch, an alternative SearchService that uses
sentence embeddings (all-MiniLM-L6-v2 via ONNX Runtime) to match by meaning
instead of keywords. It runs entirely in-process with no external
services required.
Beyond semantic matching, EmbeddingSearch addresses other gaps in
notebook search:
- Indexes paragraph output — table results and text output become
searchable, not just the code
- Extracts SQL table names — FROM/JOIN references are extracted and used
to boost related paragraphs in a two-phase ranking
- Strips interpreter prefixes — %spark.sql, %python etc. are removed so
they don't pollute search results
- Live indexing — new or updated paragraphs are searchable immediately,
no restart needed
### What type of PR is it?
Feature
### Todos
- [x] EmbeddingSearch core implementation (ONNX inference, mean pooling,
cosine similarity)
- [x] Table name extraction from SQL (FROM/JOIN regex) with two-phase
search boosting
- [x] Paragraph output indexing (TABLE, TEXT results)
- [x] Versioned binary persistence (v3 format)
- [x] Live indexing (new paragraphs searchable immediately)
- [x] Angular UI: render search results with separate code/output/tables
blocks
- [x] Classic UI: same improvements
- [x] 11 unit tests including semantic validation
- [x] Documentation
### What is the Jira issue?
- https://issues.apache.org/jira/browse/ZEPPELIN-6411
### How should this be tested?
**Automated tests:**
```bash
# Embedding search tests (requires ~86MB model download, one-time)
ZEPPELIN_EMBEDDING_TEST=true mvn test -pl zeppelin-zengine
-Dtest=EmbeddingSearchTest
# Verify no regressions to existing Lucene search
mvn test -pl zeppelin-zengine -Dtest=LuceneSearchTest
Manual testing:
1. Set zeppelin.search.semantic.enable = true in zeppelin-site.xml
2. Restart Zeppelin
3. Search for natural language queries like:
- "yesterday's spending" (Lucene: 0 results → Semantic: finds spend
queries)
- "how much do drivers earn" (finds taxi tip analysis)
- "late deliveries" (finds shipping performance queries)
- "airport rides" (both work — keyword match exists)
```
### Screenshots (if appropriate)
Semantic Search with New UI
<img width="867" height="469" alt="image"
src="https://github.com/user-attachments/assets/527d0828-0ef5-4528-b064-9c95553aa6ca"
/>
Semantic Search with Classic UI
<img width="865" height="475" alt="image"
src="https://github.com/user-attachments/assets/092fa986-46d1-41c0-b2ad-b51c04e3c583"
/>
### Questions:
- Does the license files need to update?
- Yes — NOTICE updated with ONNX Runtime (MIT) and DJL Tokenizers (Apache
2.0) attribution.
- Is there breaking changes for older versions?
- No. Disabled by default. Existing LuceneSearch behavior is unchanged.
Closes #5218 from kkalyan/ZEPPELIN-6411-semantic-search.
Signed-off-by: Jongyoul Lee <[email protected]>
---
LICENSE | 18 +
NOTICE | 12 +
bin/install-search-model.sh | 78 ++
docs/embedding-search.md | 224 +++++
zeppelin-server/pom.xml | 22 +
.../zeppelin/conf/ZeppelinConfiguration.java | 5 +
.../zeppelin/interpreter/InterpreterFactory.java | 4 +
.../apache/zeppelin/search/EmbeddingSearch.java | 952 +++++++++++++++++++++
.../org/apache/zeppelin/search/LuceneSearch.java | 13 +-
.../org/apache/zeppelin/server/ZeppelinServer.java | 7 +-
.../zeppelin/search/EmbeddingSearchTest.java | 364 ++++++++
.../src/app/interfaces/notebook.ts | 3 +
.../result-item/result-item.component.html | 22 +-
.../result-item/result-item.component.less | 131 ++-
.../result-item/result-item.component.ts | 213 ++---
.../pages/workspace/notebook/notebook.component.ts | 12 +-
.../src/app/share/header/header.component.html | 12 +-
.../src/app/share/header/header.component.ts | 18 +
.../src/app/search/result-list.controller.js | 172 ++--
zeppelin-web/src/app/search/result-list.html | 28 +-
zeppelin-web/src/app/search/search.css | 61 ++
21 files changed, 2080 insertions(+), 291 deletions(-)
diff --git a/LICENSE b/LICENSE
index 3c3f246917..c285c1b196 100644
--- a/LICENSE
+++ b/LICENSE
@@ -277,3 +277,21 @@ Eclipse Public License - v 1.0
The following components are provided under the Eclipse Public License,
version 1.0. See file headers and project links for details.
(Eclipse Public License) pty4j - http://www.eclipse.org/legal/epl-v10.html
+
+========================================================================
+MIT License
+========================================================================
+The following components are provided under the MIT License. See file headers
and project links for details.
+
+ (MIT License) ONNX Runtime (https://github.com/microsoft/onnxruntime)
+ Licensed under the MIT License.
+ https://github.com/microsoft/onnxruntime/blob/main/LICENSE
+
+========================================================================
+Apache License 2.0 (bundled dependencies)
+========================================================================
+The following components are provided under the Apache License 2.0. See file
headers and project links for details.
+
+ (Apache License 2.0) DJL - Deep Java Library Tokenizers
(https://github.com/deepjavalibrary/djl)
+ Licensed under the Apache License, Version 2.0.
+ https://github.com/deepjavalibrary/djl/blob/master/LICENSE
diff --git a/NOTICE b/NOTICE
index bd7844b811..e1da12ea08 100644
--- a/NOTICE
+++ b/NOTICE
@@ -12,3 +12,15 @@ Portions of this software were developed at NFLabs, Inc.
(http://www.nflabs.com)
* Pseudo terminal(PTY) implementation in Java
* (Eclipse Public License) pty4j - http://www.eclipse.org/legal/epl-v10.html
+
+2. ONNX Runtime
+
+ * Cross-platform ML inferencing and training accelerator
+ * (MIT License) onnxruntime - https://github.com/microsoft/onnxruntime
+ * Copyright (c) Microsoft Corporation
+
+3. Deep Java Library (DJL) HuggingFace Tokenizers
+
+ * Java binding for HuggingFace tokenizers
+ * (Apache License 2.0) djl-tokenizers -
https://github.com/deepjavalibrary/djl
+ * Copyright (c) Amazon.com, Inc.
diff --git a/bin/install-search-model.sh b/bin/install-search-model.sh
new file mode 100755
index 0000000000..18ed47e11f
--- /dev/null
+++ b/bin/install-search-model.sh
@@ -0,0 +1,78 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Downloads the sentence-transformer model required for semantic search.
+# Run this once before starting Zeppelin with
zeppelin.search.semantic.enable=true.
+#
+# Usage: bin/install-search-model.sh [INDEX_PATH]
+# INDEX_PATH defaults to /tmp/zeppelin-index (matches
zeppelin.search.index.path)
+
+set -euo pipefail
+
+MODEL_NAME="all-MiniLM-L6-v2"
+MODEL_REVISION="c9745ed1d9f207416be6d2e6f8de32d1f16199bf"
+BASE_URL="https://huggingface.co/sentence-transformers/${MODEL_NAME}/resolve/${MODEL_REVISION}"
+
+# Expected SHA256 checksums for integrity verification
+MODEL_SHA256="6fd5d72fe4589f189f8ebc006442dbb529bb7ce38f8082112682524616046452"
+TOKENIZER_SHA256="be50c3628f2bf5bb5e3a7f17b1f74611b2561a3a27eeab05e5aa30f411572037"
+
+INDEX_PATH="${1:-/tmp/zeppelin-index}"
+MODEL_DIR="${INDEX_PATH}/models/${MODEL_NAME}"
+
+mkdir -p "${MODEL_DIR}"
+
+verify_sha256() {
+ local file="$1" expected="$2"
+ local actual
+ if command -v sha256sum >/dev/null 2>&1; then
+ actual=$(sha256sum "${file}" | cut -d' ' -f1)
+ elif command -v shasum >/dev/null 2>&1; then
+ actual=$(shasum -a 256 "${file}" | cut -d' ' -f1)
+ else
+ echo "WARNING: Neither sha256sum nor shasum found, skipping integrity
check for ${file}"
+ return 0
+ fi
+ if [ "${actual}" != "${expected}" ]; then
+ echo "ERROR: SHA256 mismatch for ${file}"
+ echo " Expected: ${expected}"
+ echo " Actual: ${actual}"
+ rm -f "${file}"
+ return 1
+ fi
+ echo "SHA256 verified: ${file}"
+}
+
+download() {
+ local url="$1" dest="$2" expected_sha="$3"
+ if [ -f "${dest}" ]; then
+ if verify_sha256 "${dest}" "${expected_sha}"; then
+ echo "Already exists and verified: ${dest}"
+ return
+ fi
+ echo "Existing file failed verification, re-downloading..."
+ fi
+ echo "Downloading ${url} ..."
+ curl -fSL --connect-timeout 30 --max-time 300 -o "${dest}.tmp" "${url}"
+ mv "${dest}.tmp" "${dest}"
+ verify_sha256 "${dest}" "${expected_sha}"
+ echo "Saved: ${dest}"
+}
+
+download "${BASE_URL}/onnx/model.onnx" "${MODEL_DIR}/model.onnx"
"${MODEL_SHA256}"
+download "${BASE_URL}/tokenizer.json" "${MODEL_DIR}/tokenizer.json"
"${TOKENIZER_SHA256}"
+
+echo "Model installed to ${MODEL_DIR}"
diff --git a/docs/embedding-search.md b/docs/embedding-search.md
new file mode 100644
index 0000000000..5dac212ed2
--- /dev/null
+++ b/docs/embedding-search.md
@@ -0,0 +1,224 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# ZEPPELIN-6411: Semantic Search for Notebooks using Sentence Embeddings
+
+## Summary
+
+Add `EmbeddingSearch` — a new `SearchService` implementation that enables
natural language
+search across Zeppelin notebooks using ONNX-based sentence embeddings. This is
a drop-in
+replacement for `LuceneSearch` that understands meaning, not just keywords.
+
+**Example**: Searching "yesterday's spending" finds paragraphs containing
+`SELECT sum(cost) FROM analytics.daily_sales WHERE date = current_date -
interval '1' day`
+— something keyword search cannot do (returns 0 results with LuceneSearch).
+
+## Motivation
+
+Zeppelin's current search (`LuceneSearch`) uses keyword-based full-text search
with
+Lucene's `StandardAnalyzer`. This has several limitations for notebook search:
+
+1. **No semantic understanding** — "yesterday's spend" won't find
`current_date - 1`
+2. **Poor SQL tokenization** — `StandardAnalyzer` breaks on underscores and
dots in
+ table names like `analytics_db.daily_sales`
+3. **No output indexing** — query results (table data, text output) are not
searchable
+4. **Exact match only** — users must guess the exact terms used in notebooks
+
+For teams with hundreds or thousands of notebooks (common in data/analytics
teams),
+finding the right query becomes a significant productivity bottleneck.
+
+## Architecture
+
+```
+ SearchService (abstract)
+ ├── LuceneSearch (existing, keyword-based)
+ ├── EmbeddingSearch (new, semantic)
+ └── NoSearchService (existing, no-op)
+
+┌─────────────────────────────────────────────────────────────┐
+│ EmbeddingSearch │
+│ │
+│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
+│ │ HuggingFace │ │ ONNX Runtime │ │ In-Memory Index │ │
+│ │ Tokenizer │→ │ Inference │→ │ float[][] + meta │ │
+│ │ (DJL) │ │ (CPU) │ │ ConcurrentHashMap│ │
+│ └──────────────┘ └──────────────┘ └────────┬─────────┘ │
+│ │ │
+│ Two-phase query: │ │
+│ 1. Embed query → cosine sim → find tables │ │
+│ 2. Re-rank with table boost → top-20 │ │
+│ ▼ │
+│ Index: text + title + output + tables embedding_index.bin│
+│ (persisted to disk, versioned) │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### Model
+
+- **all-MiniLM-L6-v2**: 384-dimensional sentence embeddings
+- 86MB ONNX model (quantized version available at 22MB)
+- Downloaded on first use to `zeppelin.search.index.path/models/`
+- Runs on CPU via ONNX Runtime (~5ms per paragraph)
+
+### Index
+
+- In-memory `ConcurrentHashMap<String, IndexEntry>` with `ReadWriteLock`
+- Each entry stores: embedding (384 floats), notebook name, paragraph text,
+ title, extracted SQL table names, and paragraph output
+- 10K paragraphs ≈ 15MB RAM, 50K paragraphs ≈ 75MB RAM
+- Persisted as versioned binary file (`embedding_index.bin`, currently v3)
+- Brute-force cosine similarity: < 50ms for 50K paragraphs
+
+### What gets indexed (vs. LuceneSearch)
+
+| Content | LuceneSearch | EmbeddingSearch |
+|---------|:---:|:---:|
+| Paragraph text | ✓ | ✓ |
+| Paragraph title | ✓ | ✓ |
+| Notebook name | ✓ | ✓ (in embedding context) |
+| Paragraph output (TABLE, TEXT) | ✗ | ✓ |
+| SQL table names (FROM/JOIN) | ✗ | ✓ (extracted + boosted) |
+| Interpreter prefix stripped | ✗ | ✓ |
+
+### Two-Phase Search
+
+1. **Phase 1 — Table Discovery**: Run cosine similarity, collect SQL table
names
+ from top-20 results weighted by rank
+2. **Phase 2 — Table Boost**: Re-score results, boosting paragraphs that
reference
+ the discovered tables (+0.05 per matching table)
+
+This helps queries like "click funnel analysis" surface all paragraphs that
query
+the same tables, even if their SQL text is very different.
+
+## Configuration
+
+Disabled by default. Enable with a single property:
+
+```xml
+<!-- In zeppelin-site.xml -->
+<property>
+ <name>zeppelin.search.semantic.enable</name>
+ <value>true</value>
+</property>
+```
+
+Requires `zeppelin.search.enable = true` (already the default).
+
+### Configuration matrix
+
+| `search.enable` | `search.semantic.enable` | Result |
+|:---:|:---:|---|
+| true | false (default) | LuceneSearch (existing behavior) |
+| true | true | EmbeddingSearch (semantic) |
+| false | any | NoSearchService |
+
+## Changes
+
+### New files
+- `zeppelin-zengine/.../search/EmbeddingSearch.java` — Core implementation
(~700 lines)
+- `zeppelin-zengine/.../search/EmbeddingSearchTest.java` — 11 tests including
semantic validation
+- `docs/embedding-search.md` — This document
+
+### Modified files — Backend
+- `zeppelin-zengine/pom.xml` — Add `onnxruntime` and `djl-tokenizers`
dependencies
+- `zeppelin-zengine/.../conf/ZeppelinConfiguration.java` — Add
`ZEPPELIN_SEARCH_SEMANTIC_ENABLE`
+- `zeppelin-server/.../server/ZeppelinServer.java` — Wire `EmbeddingSearch`
based on config
+- `NOTICE` — Attribution for ONNX Runtime and DJL
+
+### Modified files — Frontend
+- `zeppelin-web-angular/.../result-item/` — Render search results with separate
+ code block, output block, and table name display (replaces Monaco editor)
+- `zeppelin-web/src/app/search/` — Same improvements for Classic UI
+
+### Dependencies added
+- `com.microsoft.onnxruntime:onnxruntime:1.18.0` (~50MB, Apache 2.0 compatible)
+- `ai.djl.huggingface:tokenizers:0.28.0` (~2MB, Apache 2.0, JNA excluded to
+ avoid version conflict with Zeppelin's existing JNA 4.1.0)
+
+## Search Result Response Contract
+
+Both `LuceneSearch` and `EmbeddingSearch` return `List<Map<String, String>>`
with
+these keys:
+
+| Key | LuceneSearch | EmbeddingSearch |
+|-----|-------------|-----------------|
+| `id` | `noteId` or `noteId/paragraph/paragraphId` | Same |
+| `name` | Notebook title | Notebook title |
+| `snippet` | Highlighted paragraph text (`<B>` tags) | Paragraph text (no
highlighting) |
+| `text` | Full paragraph text | Full paragraph text |
+| `header` | Highlighted paragraph title (`<B>` tags) | Paragraph title
(plain) |
+| `title` | Same as `header` | Paragraph title (plain) |
+| `tables` | `""` (empty) | Space-separated SQL table names |
+| `output` | `""` (empty) | Paragraph output (truncated to 300 chars) |
+
+The `title`, `tables`, and `output` fields are dedicated structured fields. The
+`header` field preserves backward compatibility — for `LuceneSearch` it
contains
+the highlighted paragraph title, for `EmbeddingSearch` it contains the plain
title.
+
+### Frontend Display
+
+Both Angular and Classic UIs render search results with:
+- **Code block**: SQL/Python code with syntax-appropriate styling
+- **Output block**: Paragraph execution results (from `output` field)
+- **Table names**: Extracted SQL table names (from `tables` field)
+- **Language badge**: `sql`, `python`, `md`, etc.
+
+## Design Decisions
+
+### Why ONNX Runtime instead of a Java ML library?
+
+ONNX Runtime is the standard inference engine for transformer models. It
supports
+the exact same model files used by Python (HuggingFace, ChromaDB, etc.),
ensuring
+embedding compatibility.
+
+### Why brute-force instead of HNSW/ANN?
+
+For Zeppelin's scale (typically < 50K paragraphs), brute-force cosine
similarity
+on normalized vectors is fast enough (< 50ms), exact (no approximation error),
+and adds zero complexity.
+
+### Why download model on first use instead of bundling?
+
+The ONNX model is 86MB. Bundling it would bloat the Zeppelin distribution.
+Downloading on first use keeps the distribution lean and allows users to swap
models.
+
+### Why not use Lucene's vector search (since 9.0)?
+
+Zeppelin uses Lucene 8.7.0. Upgrading to 9.x is a separate, larger effort.
+
+## Testing
+
+```bash
+# Run embedding search tests (requires model download, ~86MB first time)
+ZEPPELIN_EMBEDDING_TEST=true mvn test -pl zeppelin-zengine \
+ -Dtest=EmbeddingSearchTest
+
+# Run existing Lucene tests (should still pass, no changes)
+mvn test -pl zeppelin-zengine -Dtest=LuceneSearchTest
+```
+
+### Key tests
+
+- `semanticSearchFindsRelatedConcepts` — validates that "yesterday's spending"
+ ranks a SQL spend query above an unrelated user count query
+- `newParagraphIsLiveIndexed` — validates that newly added paragraphs are
+ immediately searchable without restart
+
+## Future Work
+
+- [ ] Quantized model support (22MB INT8 vs 86MB FP32)
+- [ ] Hybrid search: combine embedding similarity with keyword matching
+- [ ] Configurable model URL for air-gapped environments
+- [ ] Batch embedding during initial index rebuild
+- [ ] Similarity score display in search results
diff --git a/zeppelin-server/pom.xml b/zeppelin-server/pom.xml
index 6c6b9a0b26..8dbdb57567 100644
--- a/zeppelin-server/pom.xml
+++ b/zeppelin-server/pom.xml
@@ -42,6 +42,8 @@
<kerberos.version>2.0.0-M15</kerberos.version>
<guava.version>32.0.0-jre</guava.version>
<lucene.version>8.7.0</lucene.version>
+ <onnxruntime.version>1.18.0</onnxruntime.version>
+ <djl.version>0.28.0</djl.version>
<commons.vfs2.version>2.10.0</commons.vfs2.version>
<eclipse.jgit.version>4.5.4.201711221230-r</eclipse.jgit.version>
<eirslett.version>1.6</eirslett.version>
@@ -176,6 +178,26 @@
<version>${lucene.version}</version>
</dependency>
+ <!-- Semantic search: ONNX Runtime for model inference -->
+ <dependency>
+ <groupId>com.microsoft.onnxruntime</groupId>
+ <artifactId>onnxruntime</artifactId>
+ <version>${onnxruntime.version}</version>
+ </dependency>
+
+ <!-- Semantic search: DJL HuggingFace tokenizer -->
+ <dependency>
+ <groupId>ai.djl.huggingface</groupId>
+ <artifactId>tokenizers</artifactId>
+ <version>${djl.version}</version>
+ <exclusions>
+ <exclusion>
+ <groupId>net.java.dev.jna</groupId>
+ <artifactId>jna</artifactId>
+ </exclusion>
+ </exclusions>
+ </dependency>
+
<dependency>
<groupId>com.github.eirslett</groupId>
<artifactId>frontend-plugin-core</artifactId>
diff --git
a/zeppelin-server/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java
b/zeppelin-server/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java
index 958e4a5edd..252fd51a9b 100644
---
a/zeppelin-server/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java
+++
b/zeppelin-server/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java
@@ -840,6 +840,10 @@ public class ZeppelinConfiguration {
return getAbsoluteDir(ConfVars.ZEPPELIN_SEARCH_INDEX_PATH);
}
+ public boolean isZeppelinSearchSemanticEnable() {
+ return getBoolean(ConfVars.ZEPPELIN_SEARCH_SEMANTIC_ENABLE);
+ }
+
public boolean isOnlyYarnCluster() {
return getBoolean(ConfVars.ZEPPELIN_SPARK_ONLY_YARN_CLUSTER);
}
@@ -1131,6 +1135,7 @@ public class ZeppelinConfiguration {
ZEPPELIN_SEARCH_INDEX_REBUILD("zeppelin.search.index.rebuild", false),
ZEPPELIN_SEARCH_USE_DISK("zeppelin.search.use.disk", true),
ZEPPELIN_SEARCH_INDEX_PATH("zeppelin.search.index.path",
"/tmp/zeppelin-index"),
+ ZEPPELIN_SEARCH_SEMANTIC_ENABLE("zeppelin.search.semantic.enable", false),
ZEPPELIN_JOBMANAGER_ENABLE("zeppelin.jobmanager.enable", false),
ZEPPELIN_SPARK_ONLY_YARN_CLUSTER("zeppelin.spark.only_yarn_cluster",
false),
ZEPPELIN_SESSION_CHECK_INTERVAL("zeppelin.session.check_interval", 60 * 10
* 1000),
diff --git
a/zeppelin-server/src/main/java/org/apache/zeppelin/interpreter/InterpreterFactory.java
b/zeppelin-server/src/main/java/org/apache/zeppelin/interpreter/InterpreterFactory.java
index 95dbce1e81..bb614c29f8 100644
---
a/zeppelin-server/src/main/java/org/apache/zeppelin/interpreter/InterpreterFactory.java
+++
b/zeppelin-server/src/main/java/org/apache/zeppelin/interpreter/InterpreterFactory.java
@@ -44,6 +44,10 @@ public class InterpreterFactory implements
InterpreterFactoryInterface {
// Get the default interpreter of the defaultInterpreterSetting
InterpreterSetting defaultSetting =
interpreterSettingManager.getByName(executionContext.getDefaultInterpreterGroup());
+ if (defaultSetting == null) {
+ throw new InterpreterNotFoundException("No interpreter found for
group: "
+ + executionContext.getDefaultInterpreterGroup());
+ }
return defaultSetting.getDefaultInterpreter(executionContext);
}
diff --git
a/zeppelin-server/src/main/java/org/apache/zeppelin/search/EmbeddingSearch.java
b/zeppelin-server/src/main/java/org/apache/zeppelin/search/EmbeddingSearch.java
new file mode 100644
index 0000000000..2d60d3fc28
--- /dev/null
+++
b/zeppelin-server/src/main/java/org/apache/zeppelin/search/EmbeddingSearch.java
@@ -0,0 +1,952 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.zeppelin.search;
+
+import ai.djl.huggingface.tokenizers.Encoding;
+import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer;
+import com.google.common.collect.ImmutableMap;
+import ai.onnxruntime.OnnxTensor;
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtException;
+import ai.onnxruntime.OrtSession;
+
+import java.io.ByteArrayOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.nio.LongBuffer;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.nio.file.attribute.PosixFilePermissions;
+import java.security.MessageDigest;
+import java.security.NoSuchAlgorithmException;
+import java.util.ArrayList;
+import java.util.Locale;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.locks.ReadWriteLock;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import javax.annotation.PreDestroy;
+import jakarta.inject.Inject;
+
+import org.apache.commons.lang3.StringUtils;
+import org.apache.zeppelin.conf.ZeppelinConfiguration;
+import org.apache.zeppelin.interpreter.InterpreterResult;
+import org.apache.zeppelin.interpreter.InterpreterResultMessage;
+import org.apache.zeppelin.notebook.Note;
+import org.apache.zeppelin.notebook.Notebook;
+import org.apache.zeppelin.notebook.Paragraph;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Semantic search for Zeppelin notebooks using ONNX-based sentence embeddings.
+ *
+ * <p>Uses the all-MiniLM-L6-v2 model to generate 384-dimensional embeddings
for each
+ * paragraph's text, title, and output. Queries are embedded with the same
model and
+ * matched via cosine similarity, enabling natural language search like
+ * "yesterday's spend query" to find {@code WHERE date = current_date - 1}.
+ *
+ * <p>The embedding index is held in memory (float[][] + metadata) and
persisted to a
+ * single binary file on disk. For typical Zeppelin deployments (< 50K
paragraphs),
+ * brute-force cosine similarity completes in under 50ms.
+ *
+ * <p>Model files are downloaded on first use to {@code
zeppelin.search.index.path}
+ * and cached for subsequent starts.
+ */
+public class EmbeddingSearch extends SearchService {
+ private static final Logger LOGGER =
LoggerFactory.getLogger(EmbeddingSearch.class);
+
+ private static final String MODEL_NAME = "all-MiniLM-L6-v2";
+ private static final int EMBEDDING_DIM = 384;
+ private static final int MAX_SEQ_LENGTH = 256;
+ /** Maximum number of candidates returned from {@link #query(String)}. */
+ private static final int MAX_RESULTS = 20;
+ /**
+ * Cosine similarity floor for a candidate to be considered a match.
+ * Tuned empirically against all-MiniLM-L6-v2: values below this are
effectively noise
+ * for short-query / long-paragraph comparisons. See embedding-search.md for
details.
+ */
+ private static final float MIN_SIMILARITY = 0.25f;
+ private static final int MAX_TEXT_LENGTH = 1500;
+
+ static final String ID_FIELD = "id";
+ private static final String PARAGRAPH = "paragraph";
+ /** Regex to extract qualified table names from SQL (e.g. schema.table). */
+ private static final Pattern TABLE_RE =
+ Pattern.compile("(?:FROM|JOIN)\\s+([a-zA-Z_]\\w*\\.[a-zA-Z_]\\w*)",
Pattern.CASE_INSENSITIVE);
+ /**
+ * Additive score boost applied to a candidate for each relevant table it
references.
+ * Chosen small enough that it only breaks ties among already-similar
candidates
+ * and cannot promote semantically unrelated results past {@link
#MIN_SIMILARITY}.
+ */
+ private static final float TABLE_BOOST = 0.05f;
+ /**
+ * Additive score boost when the query string appears literally in the
indexed text.
+ * Ensures exact keyword matches surface even when the embedding similarity
is low
+ * (e.g. searching "TETRIS" in SQL containing TETRIS_VIDEO_SINGLE_MEDIA).
+ */
+ private static final float KEYWORD_BOOST = 0.30f;
+ /**
+ * Fraction of the top table's weight used as the cutoff for "relevant"
tables in Phase 1
+ * of {@link #query(String)}. Tables below this share are dropped from the
boost set
+ * to avoid amplifying incidental mentions.
+ */
+ private static final float TABLE_WEIGHT_THRESHOLD_RATIO = 0.2f;
+ private static final long FLUSH_INTERVAL_SECONDS = 5;
+ /**
+ * Hard upper bound on deserialized entry count to protect against a
corrupted/tampered
+ * index file causing unbounded allocation on startup. 10M paragraphs is
well beyond any
+ * plausible deployment (~18 GB of vectors alone at 384 floats/entry).
+ */
+ private static final int MAX_INDEX_ENTRIES = 10_000_000;
+ private static final String INDEX_FILE_NAME = "embedding_index.bin";
+ /** Binary format version written by {@link #saveIndex()} and required by
{@link #loadIndex()}. */
+ private static final int INDEX_VERSION = 3;
+ private static final String EXPECTED_MODEL_SHA256 =
+ "6fd5d72fe4589f189f8ebc006442dbb529bb7ce38f8082112682524616046452";
+
+ private final Notebook notebook;
+ private final Path indexPath;
+
+ // ONNX inference
+ private OrtEnvironment ortEnv;
+ private OrtSession ortSession;
+ private HuggingFaceTokenizer tokenizer;
+
+ // In-memory vector index: docId -> (embedding, metadata)
+ private final ConcurrentHashMap<String, IndexEntry> index = new
ConcurrentHashMap<>();
+ private final ReadWriteLock indexLock = new ReentrantReadWriteLock();
+ private final AtomicBoolean indexDirty = new AtomicBoolean(false);
+ private final ScheduledExecutorService flushScheduler =
+ Executors.newSingleThreadScheduledExecutor(r -> {
+ Thread t = new Thread(r, "EmbeddingSearch-flush");
+ t.setDaemon(true);
+ return t;
+ });
+
+ /** A single indexed document (paragraph or note name). */
+ // TODO(ZEPPELIN-6413): Reduce in-memory duplication by keeping only
{embedding, docId} here
+ // and rehydrating text/title/output from Notebook.processNote() at query
time. Needs a perf
+ // comparison against the current in-memory path and a consistency story on
LRU eviction.
+ private static class IndexEntry {
+ final float[] embedding;
+ final String noteName;
+ final String text;
+ final String title;
+ final String tables;
+ final String output;
+
+ IndexEntry(float[] embedding, String noteName, String text, String title,
+ String tables, String output) {
+ this.embedding = embedding;
+ this.noteName = noteName;
+ this.text = text;
+ this.title = title;
+ this.tables = tables;
+ this.output = output;
+ }
+ }
+
+ @Inject
+ public EmbeddingSearch(ZeppelinConfiguration zConf, Notebook notebook)
throws IOException {
+ super("EmbeddingSearch");
+ this.notebook = notebook;
+ this.indexPath = Paths.get(zConf.getZeppelinSearchIndexPath());
+ Files.createDirectories(indexPath);
+ restrictPermissions(indexPath);
+
+ try {
+ initModel();
+ } catch (Exception e) {
+ throw new IOException("Failed to initialize embedding model", e);
+ }
+
+ boolean indexLoaded = loadIndex();
+ if (shouldBootstrapIndex(zConf, indexLoaded)) {
+ notebook.addInitConsumer(this::addNoteIndex);
+ }
+ flushScheduler.scheduleWithFixedDelay(this::flushIfDirty,
+ FLUSH_INTERVAL_SECONDS, FLUSH_INTERVAL_SECONDS, TimeUnit.SECONDS);
+ this.notebook.addNotebookEventListener(this);
+ }
+
+ /** Package-private constructor for testing without DI. */
+ EmbeddingSearch(ZeppelinConfiguration zConf, Notebook notebook, boolean
skipModel)
+ throws IOException {
+ super("EmbeddingSearch");
+ this.notebook = notebook;
+ this.indexPath = Paths.get(zConf.getZeppelinSearchIndexPath());
+ Files.createDirectories(indexPath);
+ restrictPermissions(indexPath);
+ if (!skipModel) {
+ try {
+ initModel();
+ } catch (Exception e) {
+ throw new IOException("Failed to initialize embedding model", e);
+ }
+ }
+ boolean indexLoaded = loadIndex();
+ if (shouldBootstrapIndex(zConf, indexLoaded)) {
+ notebook.addInitConsumer(this::addNoteIndex);
+ }
+ flushScheduler.scheduleWithFixedDelay(this::flushIfDirty,
+ FLUSH_INTERVAL_SECONDS, FLUSH_INTERVAL_SECONDS, TimeUnit.SECONDS);
+ this.notebook.addNotebookEventListener(this);
+ }
+
+ private static void restrictPermissions(Path dir) {
+ try {
+ if (Files.getFileStore(dir).supportsFileAttributeView("posix")) {
+ Files.setPosixFilePermissions(dir,
+ PosixFilePermissions.fromString("rwx------"));
+ }
+ } catch (IOException e) {
+ LOGGER.warn("Could not restrict permissions on {}", dir, e);
+ }
+ if (dir.toAbsolutePath().startsWith("/tmp")) {
+ LOGGER.warn("zeppelin.search.index.path is under /tmp ({}); "
+ + "paragraph text and output will be readable by other local users. "
+ + "Consider setting it to a private directory.", dir);
+ }
+ }
+
+ // ---- Model initialization ----
+
+ private void initModel() throws OrtException, IOException {
+ Path modelDir = indexPath.resolve("models").resolve(MODEL_NAME);
+ Files.createDirectories(modelDir);
+
+ Path modelFile = modelDir.resolve("model.onnx");
+ Path tokenizerFile = modelDir.resolve("tokenizer.json");
+
+ if (!Files.exists(modelFile) || !Files.exists(tokenizerFile)) {
+ throw new IOException(
+ "Embedding model not found at " + modelDir + ". "
+ + "Run bin/install-search-model.sh before enabling semantic
search.");
+ }
+
+ verifyModelSha256(modelFile);
+
+ ortEnv = OrtEnvironment.getEnvironment();
+ OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
+ opts.setIntraOpNumThreads(Runtime.getRuntime().availableProcessors());
+ ortSession = ortEnv.createSession(modelFile.toString(), opts);
+ tokenizer = HuggingFaceTokenizer.newInstance(tokenizerFile);
+ LOGGER.info("Embedding model loaded: {}, dim={}", MODEL_NAME,
EMBEDDING_DIM);
+ }
+
+ private static void verifyModelSha256(Path modelFile) throws IOException {
+ try {
+ MessageDigest digest = MessageDigest.getInstance("SHA-256");
+ byte[] fileBytes = Files.readAllBytes(modelFile);
+ byte[] hash = digest.digest(fileBytes);
+ StringBuilder sb = new StringBuilder();
+ for (byte b : hash) {
+ sb.append(String.format("%02x", b));
+ }
+ String actual = sb.toString();
+ if (!EXPECTED_MODEL_SHA256.equals(actual)) {
+ throw new IOException("model.onnx SHA256 mismatch — expected "
+ + EXPECTED_MODEL_SHA256 + " but got " + actual
+ + ". Re-run bin/install-search-model.sh");
+ }
+ LOGGER.info("Model SHA256 verified: {}", modelFile);
+ } catch (NoSuchAlgorithmException e) {
+ LOGGER.warn("SHA-256 not available, skipping model integrity check", e);
+ }
+ }
+
+ // ---- Embedding computation ----
+
+ /**
+ * Compute a normalized embedding for the given text.
+ * Uses mean pooling over token embeddings with attention mask.
+ */
+ float[] embed(String text) {
+ if (ortSession == null || tokenizer == null) {
+ return new float[EMBEDDING_DIM];
+ }
+ try {
+ Encoding encoding = tokenizer.encode(text, true, true);
+ long[] inputIds = encoding.getIds();
+ long[] attentionMask = encoding.getAttentionMask();
+
+ // Truncate to max sequence length
+ int seqLen = Math.min(inputIds.length, MAX_SEQ_LENGTH);
+ long[] ids = new long[seqLen];
+ long[] mask = new long[seqLen];
+ long[] tokenTypeIds = new long[seqLen];
+ System.arraycopy(inputIds, 0, ids, 0, seqLen);
+ System.arraycopy(attentionMask, 0, mask, 0, seqLen);
+
+ long[] shape = {1, seqLen};
+ OnnxTensor idsTensor = null;
+ OnnxTensor maskTensor = null;
+ OnnxTensor typeTensor = null;
+ try {
+ idsTensor = OnnxTensor.createTensor(ortEnv, LongBuffer.wrap(ids),
shape);
+ maskTensor = OnnxTensor.createTensor(ortEnv, LongBuffer.wrap(mask),
shape);
+ typeTensor = OnnxTensor.createTensor(ortEnv,
LongBuffer.wrap(tokenTypeIds), shape);
+
+ Map<String, OnnxTensor> inputs = new HashMap<>();
+ inputs.put("input_ids", idsTensor);
+ inputs.put("attention_mask", maskTensor);
+ inputs.put("token_type_ids", typeTensor);
+
+ try (OrtSession.Result result = ortSession.run(inputs)) {
+ // Output shape: [1, seqLen, 384] — mean pool over sequence dim
+ float[][][] output = (float[][][]) result.get(0).getValue();
+ float[] pooled = meanPool(output[0], mask, seqLen);
+ normalize(pooled);
+ return pooled;
+ }
+ } finally {
+ if (idsTensor != null) {
+ idsTensor.close();
+ }
+ if (maskTensor != null) {
+ maskTensor.close();
+ }
+ if (typeTensor != null) {
+ typeTensor.close();
+ }
+ }
+ } catch (OrtException e) {
+ LOGGER.error("Embedding failed for text length {}", text.length(), e);
+ return new float[EMBEDDING_DIM];
+ }
+ }
+
+ /** Mean pooling: average token embeddings weighted by attention mask. */
+ private static float[] meanPool(float[][] tokenEmbeddings, long[] mask, int
seqLen) {
+ float[] result = new float[EMBEDDING_DIM];
+ float maskSum = 0;
+ for (int i = 0; i < seqLen; i++) {
+ if (mask[i] == 1) {
+ maskSum++;
+ for (int j = 0; j < EMBEDDING_DIM; j++) {
+ result[j] += tokenEmbeddings[i][j];
+ }
+ }
+ }
+ if (maskSum > 0) {
+ for (int j = 0; j < EMBEDDING_DIM; j++) {
+ result[j] /= maskSum;
+ }
+ }
+ return result;
+ }
+
+ /** L2-normalize in place. */
+ private static void normalize(float[] vec) {
+ float norm = 0;
+ for (float v : vec) {
+ norm += v * v;
+ }
+ norm = (float) Math.sqrt(norm);
+ if (norm > 0) {
+ for (int i = 0; i < vec.length; i++) {
+ vec[i] /= norm;
+ }
+ }
+ }
+
+ /** Cosine similarity between two normalized vectors (= dot product). */
+ private static float cosineSimilarity(float[] a, float[] b) {
+ float dot = 0;
+ for (int i = 0; i < a.length; i++) {
+ dot += a[i] * b[i];
+ }
+ return dot;
+ }
+
+ /**
+ * Wrap occurrences of each query word in {@code <B>} tags (case-insensitive)
+ * to match Lucene's highlighting convention.
+ */
+ static String highlightTerms(String text, String queryStr) {
+ if (StringUtils.isBlank(text) || StringUtils.isBlank(queryStr)) {
+ return text;
+ }
+ String[] words = queryStr.split("\\s+");
+ for (String word : words) {
+ if (word.isEmpty()) {
+ continue;
+ }
+ String escaped = Pattern.quote(word);
+ text = text.replaceAll("(?i)(" + escaped + ")", "<B>$1</B>");
+ }
+ return text;
+ }
+
+ // ---- Text extraction ----
+
+ /**
+ * Strip interpreter prefix like {@code %spark.sql}, {@code %athena} from
paragraph text.
+ * Handles both {@code %name\ncode} and {@code %name code} formats.
+ */
+ static String stripInterpreterPrefix(String text) {
+ if (text == null || !text.startsWith("%")) {
+ return text;
+ }
+ // Find end of interpreter directive: first newline or first space after
%word
+ int newlineIdx = text.indexOf('\n');
+ if (newlineIdx >= 0) {
+ return text.substring(newlineIdx + 1);
+ }
+ // Single-line: "%interpreter some code" — strip up to first space
+ int spaceIdx = text.indexOf(' ');
+ if (spaceIdx >= 0) {
+ return text.substring(spaceIdx + 1);
+ }
+ // Just "%interpreter" with no content
+ return "";
+ }
+
+ /**
+ * Extract qualified table names (schema.table) from SQL text.
+ */
+ static String extractTables(String text) {
+ if (text == null) {
+ return "";
+ }
+ Set<String> tables = new HashSet<>();
+ Matcher m = TABLE_RE.matcher(text);
+ while (m.find()) {
+ tables.add(m.group(1).toLowerCase());
+ }
+ return String.join(" ", tables);
+ }
+
+ /**
+ * Extract searchable output text from paragraph results (TABLE headers,
TEXT).
+ */
+ static String extractOutput(Paragraph p) {
+ InterpreterResult result = p.getReturn();
+ if (result == null) {
+ return "";
+ }
+ StringBuilder sb = new StringBuilder();
+ for (InterpreterResultMessage msg : result.message()) {
+ if (msg.getType() == InterpreterResult.Type.TEXT
+ || msg.getType() == InterpreterResult.Type.TABLE) {
+ String data = msg.getData();
+ if (StringUtils.isNotBlank(data)) {
+ sb.append(data, 0, Math.min(data.length(), 500));
+ sb.append("\n");
+ }
+ }
+ }
+ return sb.toString().trim();
+ }
+
+ /**
+ * Build a rich text representation of a paragraph for embedding.
+ * Includes code/text, title, table names, and output (table headers, text
results).
+ */
+ private String buildParagraphText(String noteName, Paragraph p) {
+ StringBuilder sb = new StringBuilder();
+ if (StringUtils.isNotBlank(noteName)) {
+ sb.append("Notebook: ").append(noteName).append("\n");
+ }
+ if (StringUtils.isNotBlank(p.getTitle())) {
+ sb.append(p.getTitle()).append("\n");
+ }
+ if (StringUtils.isNotBlank(p.getText())) {
+ String text = p.getText();
+ // Strip interpreter prefix (e.g. "%spark.sql", "%athena\n")
+ text = stripInterpreterPrefix(text);
+ // Include extracted table names for better semantic matching
+ String tables = extractTables(text);
+ if (StringUtils.isNotBlank(tables)) {
+ sb.append("Tables: ").append(tables).append("\n");
+ }
+ sb.append(text, 0, Math.min(text.length(), MAX_TEXT_LENGTH));
+ }
+ // Include output for richer semantic matching
+ InterpreterResult result = p.getReturn();
+ if (result != null) {
+ for (InterpreterResultMessage msg : result.message()) {
+ if (msg.getType() == InterpreterResult.Type.TEXT
+ || msg.getType() == InterpreterResult.Type.TABLE) {
+ String data = msg.getData();
+ if (StringUtils.isNotBlank(data)) {
+ sb.append("\n").append(data, 0, Math.min(data.length(), 500));
+ }
+ }
+ }
+ }
+ return sb.toString();
+ }
+
+ // ---- SearchService implementation ----
+
+ @Override
+ // TODO(ZEPPELIN-6414): Accept user/roles (or a readability Predicate) and
apply the auth
+ // filter before Phase-1 table collection and before the top-K cutoff.
Currently the REST
+ // layer filters after truncation, which can hide results the caller is
authorized for and
+ // lets inaccessible notes contaminate the table-boost ranking. Requires a
SearchService
+ // interface change that also affects LuceneSearch.
+ public List<Map<String, String>> query(String queryStr) {
+ if (StringUtils.isBlank(queryStr) || index.isEmpty()) {
+ return Collections.emptyList();
+ }
+
+ float[] queryEmbedding = embed(queryStr);
+ String queryLower = queryStr.toLowerCase(Locale.ROOT);
+
+ // Phase 1: find top-N results and discover relevant tables
+ List<Map.Entry<String, Float>> scored = new ArrayList<>();
+ indexLock.readLock().lock();
+ try {
+ for (Map.Entry<String, IndexEntry> entry : index.entrySet()) {
+ float sim = cosineSimilarity(queryEmbedding,
entry.getValue().embedding);
+ IndexEntry ie = entry.getValue();
+ if (ie.text != null &&
ie.text.toLowerCase(Locale.ROOT).contains(queryLower)) {
+ sim += KEYWORD_BOOST;
+ }
+ scored.add(Map.entry(entry.getKey(), sim));
+ }
+ } finally {
+ indexLock.readLock().unlock();
+ }
+ scored.sort((a, b) -> Float.compare(b.getValue(), a.getValue()));
+
+ // Collect tables from the top candidates, weighted by rank
+ Map<String, Float> tableWeights = new HashMap<>();
+ for (int i = 0; i < Math.min(scored.size(), MAX_RESULTS); i++) {
+ IndexEntry entry = index.get(scored.get(i).getKey());
+ if (entry != null && StringUtils.isNotBlank(entry.tables)) {
+ float weight = 1.0f / (i + 1);
+ for (String t : entry.tables.split(" ")) {
+ tableWeights.merge(t, weight, Float::sum);
+ }
+ }
+ }
+ // Keep tables with weight >= TABLE_WEIGHT_THRESHOLD_RATIO of top table's
weight
+ Set<String> relevantTables = new HashSet<>();
+ if (!tableWeights.isEmpty()) {
+ float maxWeight = Collections.max(tableWeights.values());
+ float threshold = maxWeight * TABLE_WEIGHT_THRESHOLD_RATIO;
+ tableWeights.forEach((t, w) -> {
+ if (w >= threshold) {
+ relevantTables.add(t);
+ }
+ });
+ }
+
+ // Phase 2: re-score with table boost, collect candidates with boosted
scores
+ List<Map.Entry<Map<String, String>, Float>> candidates = new ArrayList<>();
+ for (int i = 0; i < scored.size() && candidates.size() < MAX_RESULTS; i++)
{
+ float sim = scored.get(i).getValue();
+ if (sim < MIN_SIMILARITY) {
+ break;
+ }
+ String docId = scored.get(i).getKey();
+ IndexEntry entry = index.get(docId);
+ if (entry == null || StringUtils.isBlank(entry.text)) {
+ continue;
+ }
+ if (!relevantTables.isEmpty() && StringUtils.isNotBlank(entry.tables)) {
+ for (String t : entry.tables.split(" ")) {
+ if (relevantTables.contains(t)) {
+ sim += TABLE_BOOST;
+ }
+ }
+ }
+ String title = entry.title != null ? entry.title : "";
+ String tables = entry.tables != null ? entry.tables : "";
+ String output = "";
+ if (StringUtils.isNotBlank(entry.output)) {
+ output = entry.output;
+ if (output.length() > 300) {
+ output = output.substring(0, 300);
+ }
+ }
+ String snippet = highlightTerms(entry.text, queryStr);
+ String highlightedTitle = highlightTerms(title, queryStr);
+ candidates.add(Map.entry(ImmutableMap.<String, String>builder()
+ .put("id", docId)
+ .put("name", entry.noteName != null ? entry.noteName : "")
+ .put("snippet", snippet)
+ .put("text", entry.text)
+ .put("header", highlightedTitle)
+ .put("title", highlightedTitle)
+ .put("tables", tables)
+ .put("output", output)
+ .build(), sim));
+ }
+ // Re-sort by boosted score
+ candidates.sort((a, b) -> Float.compare(b.getValue(), a.getValue()));
+ List<Map<String, String>> results = new ArrayList<>();
+ for (Map.Entry<Map<String, String>, Float> c : candidates) {
+ results.add(c.getKey());
+ }
+ return results;
+ }
+
+ @Override
+ public void addNoteIndex(String noteId) {
+ try {
+ notebook.processNote(noteId, note -> {
+ if (note != null) {
+ indexNote(note);
+ }
+ return null;
+ });
+ markDirty();
+ } catch (IOException e) {
+ LOGGER.error("Failed to add note {} to index", noteId, e);
+ }
+ }
+
+ @Override
+ public void addParagraphIndex(String noteId, String paragraphId) {
+ try {
+ notebook.processNote(noteId, note -> {
+ if (note != null) {
+ Paragraph p = note.getParagraph(paragraphId);
+ if (p != null) {
+ indexParagraph(note.getId(), note.getName(), p);
+ }
+ }
+ return null;
+ });
+ markDirty();
+ } catch (IOException e) {
+ LOGGER.error("Failed to add paragraph {} of note {}", paragraphId,
noteId, e);
+ }
+ }
+
+ @Override
+ public void updateNoteIndex(String noteId) {
+ // Mirror LuceneSearch.updateNoteIndex: this event path is invoked for
note-metadata
+ // changes (rename, cron config, etc.) — paragraph edits come through the
+ // add/updateParagraphIndex path. Re-embedding every paragraph here was
pure waste for
+ // cron changes and heavy even for renames. Just refresh the noteName
field on existing
+ // entries; the embedding slightly drifts (note name contributes to
buildParagraphText)
+ // but self-heals on the next paragraph touch.
+ if (noteId == null) {
+ return;
+ }
+ try {
+ notebook.processNote(noteId, note -> {
+ if (note == null) {
+ return null;
+ }
+ String newName = note.getName();
+ if (newName == null) {
+ return null;
+ }
+ indexLock.writeLock().lock();
+ try {
+ boolean mutated = false;
+ String notePrefix = noteId + "/";
+ for (Map.Entry<String, IndexEntry> e : index.entrySet()) {
+ String docId = e.getKey();
+ if (!docId.equals(noteId) && !docId.startsWith(notePrefix)) {
+ continue;
+ }
+ IndexEntry old = e.getValue();
+ if (newName.equals(old.noteName)) {
+ continue;
+ }
+ e.setValue(new IndexEntry(old.embedding, newName, old.text,
old.title,
+ old.tables, old.output));
+ mutated = true;
+ }
+ if (mutated) {
+ markDirty();
+ }
+ } finally {
+ indexLock.writeLock().unlock();
+ }
+ return null;
+ });
+ } catch (IOException e) {
+ LOGGER.error("Failed to update note index {}", noteId, e);
+ }
+ }
+
+ @Override
+ public void updateParagraphIndex(String noteId, String paragraphId) {
+ try {
+ notebook.processNote(noteId, note -> {
+ if (note != null) {
+ Paragraph p = note.getParagraph(paragraphId);
+ if (p != null) {
+ indexParagraph(noteId, note.getName(), p);
+ }
+ }
+ return null;
+ });
+ markDirty();
+ } catch (IOException e) {
+ LOGGER.error("Failed to update paragraph {} of note {}", paragraphId,
noteId, e);
+ }
+ }
+
+ @Override
+ public void deleteNoteIndex(String noteId) {
+ if (noteId == null) {
+ return;
+ }
+ indexLock.writeLock().lock();
+ try {
+ index.entrySet().removeIf(e ->
+ e.getKey().equals(noteId) || e.getKey().startsWith(noteId + "/"));
+ } finally {
+ indexLock.writeLock().unlock();
+ }
+ markDirty();
+ }
+
+ @Override
+ public void deleteParagraphIndex(String noteId, String paragraphId) {
+ if (noteId == null) {
+ return;
+ }
+ String docId = paragraphId != null
+ ? String.join("/", noteId, PARAGRAPH, paragraphId)
+ : noteId;
+ index.remove(docId);
+ markDirty();
+ }
+
+ @Override
+ @PreDestroy
+ public void close() {
+ super.close();
+ flushScheduler.shutdown();
+ flushIfDirty();
+ try {
+ if (ortSession != null) {
+ ortSession.close();
+ }
+ if (tokenizer != null) {
+ tokenizer.close();
+ }
+ } catch (OrtException e) {
+ LOGGER.error("Failed to close ONNX session", e);
+ }
+ }
+
+ private void markDirty() {
+ indexDirty.set(true);
+ }
+
+ /**
+ * Decide whether to register the initial-indexing consumer.
+ *
+ * @param zConf Zeppelin configuration (for {@code isIndexRebuild})
+ * @param loaded whether {@link #loadIndex()} completed successfully
+ * @return {@code true} if the index needs to be (re)built from notebooks.
Triggers when
+ * config requests rebuild, the index file is missing, or it was
present but
+ * failed to load (corrupt/partial). A failed load also deletes the
bad file so
+ * the rebuilt index is written fresh.
+ */
+ private boolean shouldBootstrapIndex(ZeppelinConfiguration zConf, boolean
loaded) {
+ Path indexFile = indexPath.resolve(INDEX_FILE_NAME);
+ boolean fileMissing = !Files.exists(indexFile);
+ boolean corrupt = !loaded;
+ if (corrupt && !fileMissing) {
+ try {
+ Files.deleteIfExists(indexFile);
+ LOGGER.warn("Deleted corrupt embedding index file {}; will rebuild",
indexFile);
+ } catch (IOException e) {
+ LOGGER.warn("Failed to delete corrupt embedding index file {}; will
rebuild anyway",
+ indexFile, e);
+ }
+ }
+ return zConf.isIndexRebuild() || fileMissing || corrupt;
+ }
+
+ private void flushIfDirty() {
+ if (indexDirty.compareAndSet(true, false)) {
+ try {
+ saveIndex();
+ } catch (IOException e) {
+ // Re-set dirty so the next scheduled tick retries the flush
+ // instead of silently dropping the failed write until the next
mutation.
+ indexDirty.set(true);
+ LOGGER.error("Failed to flush embedding index to disk; will retry on
next tick", e);
+ }
+ }
+ }
+
+ // ---- Internal indexing ----
+
+ private void indexNote(Note note) {
+ String noteName = note.getName();
+ // Index each paragraph (note name is included in paragraph embedding text)
+ for (Paragraph p : note.getParagraphs()) {
+ indexParagraph(note.getId(), noteName, p);
+ }
+ }
+
+ private void indexParagraph(String noteId, String noteName, Paragraph p) {
+ String text = buildParagraphText(noteName, p);
+ if (StringUtils.isBlank(text)) {
+ return;
+ }
+ float[] emb = embed(text);
+ String docId = String.join("/", noteId, PARAGRAPH, p.getId());
+ String title = p.getTitle() != null ? p.getTitle() : "";
+ String pText = p.getText() != null ? stripInterpreterPrefix(p.getText()) :
"";
+ String tables = extractTables(pText);
+ String output = extractOutput(p);
+
+ indexLock.writeLock().lock();
+ try {
+ index.put(docId, new IndexEntry(emb, noteName, pText, title, tables,
output));
+ } finally {
+ indexLock.writeLock().unlock();
+ }
+ }
+
+ static String formatId(String noteId, Paragraph p) {
+ if (p != null) {
+ return String.join("/", noteId, PARAGRAPH, p.getId());
+ }
+ return noteId;
+ }
+
+ // ---- Persistence ----
+
+ /**
+ * Save index to a binary file.
+ * Format: [int:version=INDEX_VERSION][int:count] then for each entry:
+ * [utf:docId] [utf:noteName] [utf:text] [utf:title] [utf:tables]
[utf:output] [float[384]:embedding]
+ */
+ // TODO(ZEPPELIN-6412): Shard persistence by note (e.g.
index/notes/<noteId>.bin) so a single
+ // paragraph edit only rewrites that note's file instead of the full index.
Needs a per-note
+ // lock strategy, a manifest for load, and a compaction path for deletes;
may also revisit
+ // append-only log + periodic compaction as the persistence model.
+ private void saveIndex() throws IOException {
+ Path file = indexPath.resolve(INDEX_FILE_NAME);
+ Path tmpFile = indexPath.resolve(INDEX_FILE_NAME + ".tmp");
+
+ // Serialize to buffer under lock
+ byte[] data;
+ indexLock.readLock().lock();
+ try {
+ ByteArrayOutputStream baos = new ByteArrayOutputStream();
+ try (DataOutputStream out = new DataOutputStream(baos)) {
+ out.writeInt(INDEX_VERSION);
+ out.writeInt(index.size());
+ for (Map.Entry<String, IndexEntry> e : index.entrySet()) {
+ out.writeUTF(e.getKey());
+ out.writeUTF(e.getValue().noteName != null ? e.getValue().noteName :
"");
+ String text = e.getValue().text != null ? e.getValue().text : "";
+ if (text.length() > 2000) {
+ text = text.substring(0, 2000);
+ }
+ out.writeUTF(text);
+ out.writeUTF(e.getValue().title != null ? e.getValue().title : "");
+ out.writeUTF(e.getValue().tables != null ? e.getValue().tables : "");
+ String output = e.getValue().output != null ? e.getValue().output :
"";
+ if (output.length() > 1000) {
+ output = output.substring(0, 1000);
+ }
+ out.writeUTF(output);
+ for (float v : e.getValue().embedding) {
+ out.writeFloat(v);
+ }
+ }
+ }
+ data = baos.toByteArray();
+ } finally {
+ indexLock.readLock().unlock();
+ }
+
+ // Write to disk outside lock
+ Files.write(tmpFile, data);
+ Files.move(tmpFile, file,
java.nio.file.StandardCopyOption.REPLACE_EXISTING,
+ java.nio.file.StandardCopyOption.ATOMIC_MOVE);
+ // Restrict file permissions
+ try {
+ if (Files.getFileStore(file).supportsFileAttributeView("posix")) {
+ Files.setPosixFilePermissions(file,
+ PosixFilePermissions.fromString("rw-------"));
+ }
+ } catch (IOException e) {
+ LOGGER.warn("Could not restrict permissions on {}", file, e);
+ }
+ }
+
+ /**
+ * Load the index from disk.
+ *
+ * @return {@code true} if the index loaded successfully (or file was
absent);
+ * {@code false} if the file was present but failed to load or was
corrupt,
+ * signalling the caller to trigger a bootstrap rebuild.
+ */
+ private boolean loadIndex() {
+ Path file = indexPath.resolve(INDEX_FILE_NAME);
+ if (!Files.exists(file)) {
+ return true;
+ }
+ try (DataInputStream in = new DataInputStream(Files.newInputStream(file)))
{
+ int version = in.readInt();
+ if (version != INDEX_VERSION) {
+ LOGGER.warn("Index file version {} does not match expected {};
treating as corrupt "
+ + "and rebuilding", version, INDEX_VERSION);
+ return false;
+ }
+ int count = in.readInt();
+ LOGGER.info("Loading {} embedding index entries (v{}) from {}", count,
version, file);
+ if (count < 0 || count > MAX_INDEX_ENTRIES) {
+ LOGGER.error("Index entry count {} exceeds sanity bound ({}), treating
as corrupt",
+ count, MAX_INDEX_ENTRIES);
+ return false;
+ }
+ for (int i = 0; i < count; i++) {
+ String docId = in.readUTF();
+ String noteName = in.readUTF();
+ String text = in.readUTF();
+ String title = in.readUTF();
+ String tables = in.readUTF();
+ String output = in.readUTF();
+ float[] emb = new float[EMBEDDING_DIM];
+ for (int j = 0; j < EMBEDDING_DIM; j++) {
+ emb[j] = in.readFloat();
+ }
+ index.put(docId, new IndexEntry(emb, noteName, text, title, tables,
output));
+ }
+ LOGGER.info("Loaded {} entries into embedding index", index.size());
+ return true;
+ } catch (IOException e) {
+ LOGGER.warn("Failed to load embedding index from {}; will rebuild on
init", file, e);
+ // Clear any partially-loaded state so we start from a clean slate on
rebuild.
+ index.clear();
+ return false;
+ }
+ }
+}
diff --git
a/zeppelin-server/src/main/java/org/apache/zeppelin/search/LuceneSearch.java
b/zeppelin-server/src/main/java/org/apache/zeppelin/search/LuceneSearch.java
index 3f28f8eb65..904069fb33 100644
--- a/zeppelin-server/src/main/java/org/apache/zeppelin/search/LuceneSearch.java
+++ b/zeppelin-server/src/main/java/org/apache/zeppelin/search/LuceneSearch.java
@@ -190,9 +190,16 @@ public class LuceneSearch extends SearchService {
header = "";
}
matchingParagraphs.add(
- ImmutableMap.of(
- "id", path, // <noteId>/paragraph/<paragraphId>
- "name", title, "snippet", fragment, "text", text, "header",
header));
+ ImmutableMap.<String, String>builder()
+ .put("id", path)
+ .put("name", title)
+ .put("snippet", fragment)
+ .put("text", text)
+ .put("header", header)
+ .put("title", header)
+ .put("tables", "")
+ .put("output", "")
+ .build());
} else {
LOGGER.info("{}. No {} for this document", i + 1, ID_FIELD);
}
diff --git
a/zeppelin-server/src/main/java/org/apache/zeppelin/server/ZeppelinServer.java
b/zeppelin-server/src/main/java/org/apache/zeppelin/server/ZeppelinServer.java
index eca789e38b..b3f78816ae 100644
---
a/zeppelin-server/src/main/java/org/apache/zeppelin/server/ZeppelinServer.java
+++
b/zeppelin-server/src/main/java/org/apache/zeppelin/server/ZeppelinServer.java
@@ -87,6 +87,7 @@ import
org.apache.zeppelin.notebook.scheduler.NoSchedulerService;
import org.apache.zeppelin.notebook.scheduler.QuartzSchedulerService;
import org.apache.zeppelin.notebook.scheduler.SchedulerService;
import org.apache.zeppelin.plugin.PluginManager;
+import org.apache.zeppelin.search.EmbeddingSearch;
import org.apache.zeppelin.search.LuceneSearch;
import org.apache.zeppelin.search.NoSearchService;
import org.apache.zeppelin.search.SearchService;
@@ -210,7 +211,11 @@ public class ZeppelinServer implements AutoCloseable {
bind(NoSchedulerService.class).to(SchedulerService.class).in(Singleton.class);
}
if (zConf.getBoolean(ConfVars.ZEPPELIN_SEARCH_ENABLE)) {
-
bind(LuceneSearch.class).to(SearchService.class).in(Singleton.class);
+ if (zConf.isZeppelinSearchSemanticEnable()) {
+
bind(EmbeddingSearch.class).to(SearchService.class).in(Singleton.class);
+ } else {
+
bind(LuceneSearch.class).to(SearchService.class).in(Singleton.class);
+ }
} else {
bind(NoSearchService.class).to(SearchService.class).in(Singleton.class);
}
diff --git
a/zeppelin-server/src/test/java/org/apache/zeppelin/search/EmbeddingSearchTest.java
b/zeppelin-server/src/test/java/org/apache/zeppelin/search/EmbeddingSearchTest.java
new file mode 100644
index 0000000000..2eb9d4be7b
--- /dev/null
+++
b/zeppelin-server/src/test/java/org/apache/zeppelin/search/EmbeddingSearchTest.java
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.zeppelin.search;
+
+import static org.apache.zeppelin.search.EmbeddingSearch.formatId;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.zeppelin.conf.ZeppelinConfiguration;
+import org.apache.zeppelin.interpreter.InterpreterFactory;
+import org.apache.zeppelin.interpreter.InterpreterSetting;
+import org.apache.zeppelin.interpreter.InterpreterSettingManager;
+import org.apache.zeppelin.notebook.AuthorizationService;
+import org.apache.zeppelin.notebook.Note;
+import org.apache.zeppelin.notebook.NoteManager;
+import org.apache.zeppelin.notebook.Notebook;
+import org.apache.zeppelin.notebook.Paragraph;
+import org.apache.zeppelin.notebook.repo.InMemoryNotebookRepo;
+import org.apache.zeppelin.notebook.repo.NotebookRepo;
+import org.apache.zeppelin.user.AuthenticationInfo;
+import org.apache.zeppelin.user.Credentials;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.condition.EnabledIfEnvironmentVariable;
+
+/**
+ * Tests for {@link EmbeddingSearch}.
+ *
+ * <p>These tests require the ONNX model to be downloaded, so they are gated
behind
+ * the {@code ZEPPELIN_EMBEDDING_TEST} environment variable. To run:
+ * <pre>
+ * ZEPPELIN_EMBEDDING_TEST=true mvn test -pl zeppelin-zengine \
+ * -Dtest=EmbeddingSearchTest
+ * </pre>
+ *
+ * <p>The model (~86MB) is downloaded once to a temp directory and cached for
the
+ * duration of the test run.
+ */
+@EnabledIfEnvironmentVariable(named = "ZEPPELIN_EMBEDDING_TEST", matches =
"true")
+class EmbeddingSearchTest {
+
+ /** Shared model directory — avoids re-downloading 86MB model per test
method. */
+ private static File sharedModelDir;
+
+ private Notebook notebook;
+ private InterpreterSettingManager interpreterSettingManager;
+ private NoteManager noteManager;
+ private EmbeddingSearch searchService;
+ private File indexDir;
+
+ @BeforeEach
+ public void startUp() throws IOException {
+ if (sharedModelDir == null) {
+ // Look for model in the default install location first
+ File defaultModelDir = new File("/tmp/zeppelin-index/models");
+ if (defaultModelDir.exists()
+ && new File(defaultModelDir,
"all-MiniLM-L6-v2/model.onnx").exists()) {
+ sharedModelDir = defaultModelDir;
+ } else {
+ sharedModelDir =
Files.createTempDirectory("EmbeddingSearchTest-models").toFile();
+ }
+ }
+ indexDir =
Files.createTempDirectory(this.getClass().getSimpleName()).toFile();
+ // Symlink models dir so model is cached across tests
+ File modelsLink = new File(indexDir, "models");
+ Files.createSymbolicLink(modelsLink.toPath(), sharedModelDir.toPath());
+ ZeppelinConfiguration zConf = ZeppelinConfiguration.load();
+
zConf.setProperty(ZeppelinConfiguration.ConfVars.ZEPPELIN_SEARCH_INDEX_PATH.getVarName(),
+ indexDir.getAbsolutePath());
+
+ noteManager = new NoteManager(new InMemoryNotebookRepo(), zConf);
+ interpreterSettingManager = mock(InterpreterSettingManager.class);
+ InterpreterSetting defaultInterpreterSetting =
mock(InterpreterSetting.class);
+ when(defaultInterpreterSetting.getName()).thenReturn("test");
+ when(interpreterSettingManager.getDefaultInterpreterSetting())
+ .thenReturn(defaultInterpreterSetting);
+ notebook = new Notebook(zConf, mock(AuthorizationService.class),
+ mock(NotebookRepo.class), noteManager,
+ mock(InterpreterFactory.class), interpreterSettingManager,
+ mock(Credentials.class), null);
+ searchService = new EmbeddingSearch(zConf, notebook);
+ }
+
+ @AfterEach
+ public void shutDown() throws IOException {
+ searchService.close();
+ FileUtils.deleteDirectory(indexDir);
+ }
+
+ private void drainSearchEvents() throws InterruptedException {
+ while (!searchService.isEventQueueEmpty()) {
+ Thread.sleep(500);
+ }
+ Thread.sleep(500);
+ }
+
+ @Test
+ void canIndexAndQuery() throws IOException, InterruptedException {
+ // given
+ newNoteWithParagraph("Notebook1", "test");
+ String note2Id = newNoteWithParagraphs("Notebook2", "not test", "not test
at all");
+ drainSearchEvents();
+
+ // when — semantic search for a meaningful phrase
+ List<Map<String, String>> results = searchService.query("testing
something");
+
+ // then
+ assertFalse(results.isEmpty());
+ boolean foundTest = results.stream()
+ .anyMatch(r -> r.get("text").contains("test"));
+ assertTrue(foundTest, "Should find paragraph containing 'test'");
+ }
+
+ @Test
+ void canIndexAndQueryByNotebookName() throws IOException,
InterruptedException {
+ // given
+ newNoteWithParagraph("Notebook1", "test");
+ newNoteWithParagraphs("Notebook2", "not test", "not test at all");
+ drainSearchEvents();
+
+ // when
+ List<Map<String, String>> results = searchService.query("Notebook1");
+
+ // then
+ assertFalse(results.isEmpty());
+ assertTrue(results.get(0).get("name").contains("Notebook1"));
+ }
+
+ @Test
+ void canIndexAndQueryByParagraphTitle() throws IOException,
InterruptedException {
+ // given
+ newNoteWithParagraph("Notebook1", "test", "testingTitleSearch");
+ newNoteWithParagraph("Notebook2", "not test", "notTestingTitleSearch");
+ drainSearchEvents();
+
+ // when
+ List<Map<String, String>> results =
searchService.query("testingTitleSearch");
+
+ // then
+ assertFalse(results.isEmpty());
+ boolean foundTitle = results.stream()
+ .anyMatch(r -> r.get("header").contains("testingTitleSearch"));
+ assertTrue(foundTitle);
+ }
+
+ @Test
+ void semanticSearchFindsRelatedConcepts() throws IOException,
InterruptedException {
+ // given — this is the key test that differentiates from Lucene
+ newNoteWithParagraph("SpendAnalysis",
+ "SELECT sum(cost) FROM analytics.daily_sales WHERE date = current_date
- interval '1' day");
+ newNoteWithParagraph("UserCounts",
+ "SELECT count(distinct user_id) FROM sessions WHERE region = 'us'");
+ drainSearchEvents();
+
+ // when — natural language query, no exact keyword match
+ List<Map<String, String>> results = searchService.query("yesterday's
spending");
+
+ // then — should rank the spend query higher than the user count query
+ assertFalse(results.isEmpty());
+ assertEquals("SpendAnalysis", results.get(0).get("name"),
+ "Semantic search should rank spend-related paragraph first");
+ }
+
+ @Test
+ void indexKeyContract() throws IOException, InterruptedException {
+ // given
+ String note1Id = newNoteWithParagraph("Notebook1", "test");
+ drainSearchEvents();
+
+ // when
+ List<Map<String, String>> results = searchService.query("test");
+ assertFalse(results.isEmpty());
+
+ // then — find the paragraph result (not the note-name result)
+ String id = results.stream()
+ .filter(r -> r.get("id").contains("paragraph"))
+ .findFirst()
+ .map(r -> r.get("id"))
+ .orElse("");
+
+ notebook.processNote(note1Id, note1 -> {
+ String expected = formatId(note1.getId(), note1.getLastParagraph());
+ assertEquals(expected, id, "Key should be
<noteId>/paragraph/<paragraphId>");
+ return null;
+ });
+ }
+
+ @Test
+ void canNotSearchBeforeIndexing() {
+ // given NO indexing was done
+ // when
+ List<Map<String, String>> result = searchService.query("anything");
+ // then
+ assertTrue(result.isEmpty());
+ }
+
+ @Test
+ void canIndexAndReIndex() throws IOException, InterruptedException {
+ // given
+ newNoteWithParagraph("Notebook1", "test");
+ String note2Id = newNoteWithParagraphs("Notebook2", "not test", "not test
at all");
+ drainSearchEvents();
+
+ // when
+ notebook.processNote(note2Id, note2 -> {
+ Paragraph p2 = note2.getLastParagraph();
+ p2.setText("updated paragraph with unique content about reindexing");
+ searchService.updateParagraphIndex(note2Id, p2.getId());
+ return null;
+ });
+
+ // then — updated content should now be findable
+ List<Map<String, String>> results = searchService.query("reindexing
updated content");
+ assertFalse(results.isEmpty());
+ }
+
+ @Test
+ void canDeleteNull() {
+ // should not throw
+ searchService.deleteNoteIndex(null);
+ }
+
+ @Test
+ void canDeleteFromIndex() throws IOException, InterruptedException {
+ // given
+ newNoteWithParagraph("Notebook1", "test");
+ String note2Id = newNoteWithParagraphs("Notebook2", "not test", "not test
at all");
+ drainSearchEvents();
+
+ assertFalse(searchService.query("Notebook2").isEmpty());
+
+ // when
+ searchService.deleteNoteIndex(note2Id);
+
+ // then — no results should reference the deleted note's ID
+ boolean foundNote2After = searchService.query("not test at all").stream()
+ .anyMatch(r -> r.get("id").startsWith(note2Id));
+ assertFalse(foundNote2After, "Note2 should be removed from index after
deletion");
+ assertFalse(searchService.query("Notebook1").isEmpty());
+ }
+
+ @Test
+ void indexParagraphUpdatedOnNoteSave() throws IOException,
InterruptedException {
+ // given
+ String note1Id = newNoteWithParagraph("Notebook1", "test");
+ newNoteWithParagraphs("Notebook2", "not test", "not test at all");
+ drainSearchEvents();
+
+ // when
+ notebook.processNote(note1Id, note1 -> {
+ Paragraph p1 = note1.getLastParagraph();
+ p1.setText("no no no");
+ notebook.saveNote(note1, AuthenticationInfo.ANONYMOUS);
+ p1.getNote().fireParagraphUpdateEvent(p1);
+ return null;
+ });
+ drainSearchEvents();
+
+ // then — "Notebook1" note name should still be findable
+ assertFalse(searchService.query("Notebook1").isEmpty());
+ }
+
+ @Test
+ void newParagraphIsLiveIndexed() throws IOException, InterruptedException {
+ // given — one notebook exists
+ String noteId = newNoteWithParagraph("Analytics", "SELECT 1");
+ drainSearchEvents();
+
+ // when — add a new paragraph with unique content
+ notebook.processNote(noteId, note -> {
+ Paragraph p = note.addNewParagraph(AuthenticationInfo.ANONYMOUS);
+ p.setText("SELECT customer_id, SUM(amount) as lifetime_value FROM orders
GROUP BY 1");
+ notebook.saveNote(note, AuthenticationInfo.ANONYMOUS);
+ note.fireParagraphUpdateEvent(p);
+ return null;
+ });
+ drainSearchEvents();
+
+ // then — the new paragraph should be findable by semantic query
+ List<Map<String, String>> results = searchService.query("lifetime value");
+ assertFalse(results.isEmpty(), "Newly added paragraph should be
searchable");
+ boolean found = results.stream()
+ .anyMatch(r -> r.get("text").contains("lifetime_value"));
+ assertTrue(found, "Should find the paragraph with lifetime_value");
+ }
+
+ // ---- Helper methods (same as LuceneSearchTest) ----
+
+ private String newNoteWithParagraph(String noteName, String parText) throws
IOException {
+ String noteId = newNote(noteName);
+ notebook.processNote(noteId, note -> {
+ addParagraphWithText(note, parText);
+ return null;
+ });
+ // Re-index after paragraphs are added (createNote event may fire before
paragraphs exist)
+ searchService.updateNoteIndex(noteId);
+ return noteId;
+ }
+
+ private String newNoteWithParagraph(String noteName, String parText, String
title)
+ throws IOException {
+ String noteId = newNote(noteName);
+ notebook.processNote(noteId, note -> {
+ addParagraphWithTextAndTitle(note, parText, title);
+ return null;
+ });
+ searchService.updateNoteIndex(noteId);
+ return noteId;
+ }
+
+ private String newNoteWithParagraphs(String noteName, String... parTexts)
throws IOException {
+ String noteId = newNote(noteName);
+ notebook.processNote(noteId, note -> {
+ for (String parText : parTexts) {
+ addParagraphWithText(note, parText);
+ }
+ return null;
+ });
+ searchService.updateNoteIndex(noteId);
+ return noteId;
+ }
+
+ private Paragraph addParagraphWithText(Note note, String text) {
+ Paragraph p = note.addNewParagraph(AuthenticationInfo.ANONYMOUS);
+ p.setText(text);
+ return p;
+ }
+
+ private Paragraph addParagraphWithTextAndTitle(Note note, String text,
String title) {
+ Paragraph p = note.addNewParagraph(AuthenticationInfo.ANONYMOUS);
+ p.setText(text);
+ p.setTitle(title);
+ return p;
+ }
+
+ private String newNote(String name) throws IOException {
+ return notebook.createNote(name, AuthenticationInfo.ANONYMOUS);
+ }
+}
diff --git a/zeppelin-web-angular/src/app/interfaces/notebook.ts
b/zeppelin-web-angular/src/app/interfaces/notebook.ts
index c6c591524b..08db8f3acf 100644
--- a/zeppelin-web-angular/src/app/interfaces/notebook.ts
+++ b/zeppelin-web-angular/src/app/interfaces/notebook.ts
@@ -16,6 +16,9 @@ export interface NotebookSearchResultItem {
snippet: string;
text: string;
header: string;
+ title?: string;
+ tables?: string;
+ output?: string;
}
export interface NotebookCapabilities {
diff --git
a/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.html
b/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.html
index 19e3ccb6ba..5e393de912 100644
---
a/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.html
+++
b/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.html
@@ -10,13 +10,17 @@
~ limitations under the License.
-->
-<nz-card [nzTitle]="titleTemplateRef">
- <ng-template #titleTemplateRef>
- <a [routerLink]="routerLink" [queryParams]="queryParams">{{ displayName
}}</a>
- </ng-template>
- <zeppelin-code-editor
- [style.height.px]="height"
- [nzEditorOption]="editorOption"
- (nzEditorInitialized)="initializedEditor($event)"
- ></zeppelin-code-editor>
+<nz-card class="result-card" (click)="navigateToResult()">
+ <div class="result-header">
+ <a [routerLink]="routerLink" [queryParams]="queryParams"
(click)="$event.stopPropagation()">{{ displayName }}</a>
+ <span *ngIf="interpreter" class="badge" [ngClass]="interpreter">{{
interpreter }}</span>
+ </div>
+ <div *ngIf="titleHtml" class="title-block" [innerHTML]="titleHtml"></div>
+ <div *ngIf="codeHtml" class="code-block">
+ <pre [innerHTML]="codeHtml"></pre>
+ </div>
+ <div *ngIf="outputText" class="output-block">
+ <pre>{{ outputText }}</pre>
+ </div>
+ <div *ngIf="tablesText" class="tables-block">Tables: {{ tablesText }}</div>
</nz-card>
diff --git
a/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.less
b/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.less
index cb24d4e47b..96caf1916d 100644
---
a/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.less
+++
b/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.less
@@ -10,10 +10,131 @@
* limitations under the License.
*/
-::ng-deep {
- .monaco-editor {
- .mark {
- background: #fdf733;
- }
+@import 'theme-mixin';
+
+:host {
+ display: block;
+ margin-bottom: 12px;
+}
+
+.result-card {
+ cursor: pointer;
+ user-select: text;
+}
+
+.result-header {
+ display: flex;
+ align-items: center;
+ gap: 8px;
+ margin-bottom: 8px;
+}
+
+.badge {
+ font-size: 11px;
+ padding: 1px 8px;
+ border-radius: 10px;
+}
+
+.code-block {
+ border-radius: 6px;
+ padding: 10px 12px;
+ margin-bottom: 8px;
+ overflow-x: auto;
+
+ pre {
+ margin: 0;
+ font-size: 12px;
+ line-height: 1.5;
+ white-space: pre-wrap;
+ word-break: break-word;
+ max-height: 200px;
+ overflow-y: auto;
}
}
+
+.output-block {
+ border-radius: 0 4px 4px 0;
+ padding: 8px 12px;
+ margin-bottom: 8px;
+ overflow-x: auto;
+
+ pre {
+ margin: 0;
+ font-size: 11px;
+ line-height: 1.4;
+ white-space: pre-wrap;
+ word-break: break-word;
+ max-height: 120px;
+ overflow-y: auto;
+ }
+}
+
+.title-block {
+ font-size: 12px;
+ padding: 4px 0;
+ margin-bottom: 4px;
+}
+
+.tables-block {
+ font-size: 12px;
+ padding: 4px 0;
+}
+
+mark {
+ padding: 0 1px;
+ border-radius: 2px;
+}
+
+.themeMixin({
+ .badge {
+ background: @background-color-base;
+ color: @text-color-secondary;
+ }
+
+ .badge.sql {
+ background: @green-1;
+ color: @green-7;
+ }
+
+ .badge.python, .badge.pyspark {
+ background: @gold-1;
+ color: @gold-7;
+ }
+
+ .badge.md {
+ background: @blue-1;
+ color: @blue-6;
+ }
+
+ .code-block {
+ background: @background-color-light;
+ border: 1px solid @border-color-split;
+
+ pre {
+ font-family: @code-family;
+ color: @text-color;
+ }
+ }
+
+ .output-block {
+ background: @background-color-light;
+ border-left: 3px solid @border-color-base;
+
+ pre {
+ font-family: @code-family;
+ color: @text-color-secondary;
+ }
+ }
+
+ .title-block {
+ color: @text-color-secondary;
+ }
+
+ .tables-block {
+ color: @green-7;
+ }
+
+ mark {
+ background-color: @gold-1;
+ }
+});
diff --git
a/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.ts
b/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.ts
index 046a83c7c7..1c8e10545e 100644
---
a/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.ts
+++
b/zeppelin-web-angular/src/app/pages/workspace/notebook-search/result-item/result-item.component.ts
@@ -10,23 +10,9 @@
* limitations under the License.
*/
-import {
- ChangeDetectionStrategy,
- ChangeDetectorRef,
- Component,
- Input,
- NgZone,
- OnChanges,
- OnDestroy,
- SimpleChanges
-} from '@angular/core';
-import { ActivatedRoute } from '@angular/router';
+import { ChangeDetectionStrategy, Component, Input, OnChanges, SimpleChanges }
from '@angular/core';
+import { ActivatedRoute, Router } from '@angular/router';
import { NotebookSearchResultItem } from '@zeppelin/interfaces';
-import { JoinedEditorOptions } from '@zeppelin/share';
-import { getKeywordPositions, KeywordPosition } from '@zeppelin/utility';
-import { editor, Range } from 'monaco-editor';
-import IEditor = editor.IEditor;
-import IStandaloneCodeEditor = editor.IStandaloneCodeEditor;
@Component({
selector: 'zeppelin-notebook-search-result-item',
@@ -34,40 +20,39 @@ import IStandaloneCodeEditor = editor.IStandaloneCodeEditor;
styleUrls: ['./result-item.component.less'],
changeDetection: ChangeDetectionStrategy.OnPush
})
-export class NotebookSearchResultItemComponent implements OnChanges, OnDestroy
{
+export class NotebookSearchResultItemComponent implements OnChanges {
@Input() result!: NotebookSearchResultItem;
queryParams = {};
displayName = '';
routerLink: string[] = [];
- mergedStr?: string;
- keywords: string[] = [];
- highlightPositions: KeywordPosition[] = [];
- editor?: IStandaloneCodeEditor;
- height = 0;
- decorations: string[] = [];
- editorOption = {
- readOnly: true,
- fontSize: 12,
- renderLineHighlight: 'none',
- minimap: { enabled: false },
- lineNumbers: 'off',
- glyphMargin: false,
- scrollBeyondLastLine: false,
- contextmenu: false,
- scrollbar: {
- handleMouseWheel: false,
- alwaysConsumeMouseWheel: false
- }
- } as JoinedEditorOptions;
+ codeText = '';
+ codeHtml = '';
+ outputText = '';
+ tablesText = '';
+ titleHtml = '';
+ interpreter = '';
constructor(
- private ngZone: NgZone,
- private cdr: ChangeDetectorRef,
- private router: ActivatedRoute
+ private route: ActivatedRoute,
+ private router: Router
) {}
- setDisplayNameAndRouterLink(): void {
- const term = this.router.snapshot.params.queryStr;
+ ngOnChanges(changes: SimpleChanges): void {
+ if (changes.result) {
+ this.parseResult();
+ }
+ }
+
+ navigateToResult(): void {
+ const selection = window.getSelection();
+ if (selection && selection.toString().length > 0) {
+ return;
+ }
+ this.router.navigate(this.routerLink, { queryParams: this.queryParams });
+ }
+
+ private parseResult(): void {
+ const term = this.route.snapshot.params.queryStr;
const listOfId = this.result.id.split('/');
const [noteId, hasParagraph, paragraph] = listOfId;
if (!hasParagraph) {
@@ -75,110 +60,68 @@ export class NotebookSearchResultItemComponent implements
OnChanges, OnDestroy {
this.queryParams = {};
} else {
this.routerLink = ['/', 'notebook', noteId];
- this.queryParams = {
- paragraph,
- term
- };
+ this.queryParams = { paragraph, term };
}
this.displayName = this.result.name ? this.result.name : `Note ${noteId}`;
- }
- setHighlightKeyword(): void {
- let mergedStr = this.result.header ?
`${this.result.header}\n\n${this.result.snippet}` : this.result.snippet;
+ const snippet = this.result.snippet || '';
+ // HTML-escape first so raw '<' in code (e.g. WHERE id < 100) is not parsed
+ // as a DOM tag, then promote only the Lucene <B> markers to <mark>.
+ this.codeHtml = this.highlightToMark(snippet);
+ this.codeText = snippet.replace(/<\/?B>/gi, '');
+ this.interpreter = this.detectInterpreter(this.codeText);
- const regexp = /<B>(.+?)<\/B>/g;
- const matches = [];
- let match = regexp.exec(mergedStr);
+ const title = this.result.title || '';
+ this.titleHtml = this.highlightToMark(title);
- while (match !== null) {
- if (match[1]) {
- matches.push(match[1].toLocaleLowerCase());
- }
- match = regexp.exec(mergedStr);
- }
-
- mergedStr = mergedStr.replace(regexp, '$1');
- this.mergedStr = mergedStr;
- const keywords = [...new Set(matches)];
- this.highlightPositions = getKeywordPositions(keywords, mergedStr);
+ const tables = this.result.tables || '';
+ this.tablesText = tables
+ .trim()
+ .split(/\s+/)
+ .filter(t => t)
+ .join(', ');
+ this.outputText = this.result.output || '';
}
- applyHighlight() {
- if (this.editor) {
- this.decorations = this.editor.deltaDecorations(
- this.decorations,
- this.highlightPositions.map(highlight => {
- const line = highlight.line + 1;
- const character = highlight.character + 1;
- return {
- range: new Range(line, character, line, character +
highlight.length),
- options: {
- className: 'mark',
- stickiness: 1
- }
- };
- })
- );
- this.cdr.markForCheck();
- }
+ private highlightToMark(text: string): string {
+ // Escape HTML so raw '<' in source (e.g. WHERE id < 100) is not parsed as
+ // a DOM tag, then convert the Lucene <B>/<\/B> markers back to <mark>.
+ return text
+ .replace(/&/g, '&')
+ .replace(/</g, '<')
+ .replace(/>/g, '>')
+ .replace(/<B>/gi, '<mark>')
+ .replace(/<\/B>/gi, '</mark>');
}
- setLanguage() {
- const model = this.editor?.getModel();
- if (!model) {
- throw new Error('Editor model is not defined.');
+ private detectInterpreter(text: string): string {
+ if (!text) {
+ return '';
}
- const editorModes = {
- scala: /^%(\w*\.)?(spark|flink)/,
- python: /^%(\w*\.)?(pyspark|python)/,
- html: /^%(\w*\.)?(angular|ng)/,
- r: /^%(\w*\.)?(r|sparkr|knitr)/,
- sql: /^%(\w*\.)?\wql/,
- yaml: /^%(\w*\.)?\wconf/,
- markdown: /^%md/,
- shell: /^%sh/
- };
- let mode = 'text';
- for (const [modeOption, regex] of Object.entries(editorModes)) {
- if (regex.test(this.result.snippet)) {
- mode = modeOption;
- break;
- }
+ // Check interpreter prefix first — this is reliable
+ if (/^%(\w*\.)?sql/i.test(text)) {
+ return 'sql';
}
- editor.setModelLanguage(model, mode);
- }
-
- autoAdjustEditorHeight() {
- this.ngZone.run(() => {
- setTimeout(() => {
- const model = this.editor?.getModel();
- if (model) {
- this.height =
this.editor!.getOption(monaco.editor.EditorOption.lineHeight) *
(model.getLineCount() + 2);
- this.editor!.layout();
- this.cdr.markForCheck();
- }
- });
- });
- }
-
- initializedEditor(editorInstance: IEditor) {
- this.editor = editorInstance as IStandaloneCodeEditor;
- this.editor.setValue(this.mergedStr ?? '');
- this.setLanguage();
- this.autoAdjustEditorHeight();
- this.applyHighlight();
- }
-
- ngOnChanges(changes: SimpleChanges): void {
- if (changes.result) {
- this.setDisplayNameAndRouterLink();
- this.setHighlightKeyword();
- this.autoAdjustEditorHeight();
- this.applyHighlight();
+ if (/^%(\w*\.)?py/i.test(text)) {
+ return 'python';
}
- }
-
- ngOnDestroy(): void {
- this.editor?.dispose();
+ if (/^%md/i.test(text)) {
+ return 'md';
+ }
+ if (/^%sh/i.test(text)) {
+ return 'sh';
+ }
+ // Fall back to conservative heuristics only if no prefix present.
+ // Require SELECT ... FROM pattern to avoid false positives from Python
+ // "from ... import" or markdown containing words like "create".
+ if (!text.startsWith('%')) {
+ if (/\bSELECT\b/i.test(text) && /\bFROM\b/i.test(text)) {
+ return 'sql';
+ }
+ if (/^(import |from \w+ import |def |class )/m.test(text)) {
+ return 'python';
+ }
+ }
+ return '';
}
}
diff --git
a/zeppelin-web-angular/src/app/pages/workspace/notebook/notebook.component.ts
b/zeppelin-web-angular/src/app/pages/workspace/notebook/notebook.component.ts
index ff73912d18..6656945188 100644
---
a/zeppelin-web-angular/src/app/pages/workspace/notebook/notebook.component.ts
+++
b/zeppelin-web-angular/src/app/pages/workspace/notebook/notebook.component.ts
@@ -23,7 +23,7 @@ import { Title } from '@angular/platform-browser';
import { ActivatedRoute, Router } from '@angular/router';
import { isNil } from 'lodash';
import { Subject } from 'rxjs';
-import { distinctUntilKeyChanged, map, startWith, takeUntil } from
'rxjs/operators';
+import { distinctUntilKeyChanged, startWith, takeUntil } from 'rxjs/operators';
import { NzResizeEvent } from 'ng-zorro-antd/resizable';
@@ -422,14 +422,12 @@ export class NotebookComponent extends
MessageListenersManager implements OnInit
ngOnInit() {
this.activatedRoute.queryParamMap
- .pipe(
- startWith(this.activatedRoute.snapshot.queryParamMap),
- takeUntil(this.destroy$),
- map(data => data.get('paragraph'))
- )
- .subscribe(id => {
+ .pipe(startWith(this.activatedRoute.snapshot.queryParamMap),
takeUntil(this.destroy$))
+ .subscribe(params => {
+ const id = params.get('paragraph');
this.onParagraphSelect(id);
this.onParagraphScrolled(id);
+ this.onParagraphSearch(params.get('term') || '');
});
this.activatedRoute.params.pipe(takeUntil(this.destroy$),
distinctUntilKeyChanged('noteId')).subscribe(() => {
this.noteVarShareService.clear();
diff --git a/zeppelin-web-angular/src/app/share/header/header.component.html
b/zeppelin-web-angular/src/app/share/header/header.component.html
index d77aa12df7..c9d3246ee8 100644
--- a/zeppelin-web-angular/src/app/share/header/header.component.html
+++ b/zeppelin-web-angular/src/app/share/header/header.component.html
@@ -78,8 +78,18 @@
</div>
<div class="search">
<nz-input-group [nzPrefixIcon]="'search'">
- <input type="text" nz-input placeholder="Search"
(keyup.enter)="onSearch()" [(ngModel)]="queryStr" />
+ <input
+ type="text"
+ nz-input
+ placeholder="Search"
+ list="search-history"
+ (keyup.enter)="onSearch()"
+ [(ngModel)]="queryStr"
+ />
</nz-input-group>
+ <datalist id="search-history">
+ <option *ngFor="let item of searchHistory" [value]="item"></option>
+ </datalist>
</div>
<zeppelin-theme-toggle class="theme-toggle"></zeppelin-theme-toggle>
</div>
diff --git a/zeppelin-web-angular/src/app/share/header/header.component.ts
b/zeppelin-web-angular/src/app/share/header/header.component.ts
index 5dbb14f9b3..2692cb648d 100644
--- a/zeppelin-web-angular/src/app/share/header/header.component.ts
+++ b/zeppelin-web-angular/src/app/share/header/header.component.ts
@@ -34,6 +34,9 @@ export class HeaderComponent extends MessageListenersManager
implements OnInit,
noteListVisible = false;
queryStr: string | null = null;
classicUiHref: string;
+ searchHistory: string[] = [];
+ private static readonly HISTORY_KEY = 'zeppelin.search.history';
+ private static readonly MAX_HISTORY = 20;
about() {
this.nzModalService.create({
@@ -54,10 +57,20 @@ export class HeaderComponent extends
MessageListenersManager implements OnInit,
}
this.queryStr = this.queryStr.trim();
if (this.queryStr) {
+ this.addToHistory(this.queryStr);
this.router.navigate(['/search', this.queryStr]);
}
}
+ private addToHistory(term: string): void {
+ this.searchHistory = this.searchHistory.filter(h => h !== term);
+ this.searchHistory.unshift(term);
+ if (this.searchHistory.length > HeaderComponent.MAX_HISTORY) {
+ this.searchHistory = this.searchHistory.slice(0,
HeaderComponent.MAX_HISTORY);
+ }
+ localStorage.setItem(HeaderComponent.HISTORY_KEY,
JSON.stringify(this.searchHistory));
+ }
+
@MessageListener(OP.CONFIGURATIONS_INFO)
getConfiguration(data: MessageReceiveDataTypeMap[OP.CONFIGURATIONS_INFO]) {
this.ticketService.setConfiguration(data);
@@ -76,6 +89,11 @@ export class HeaderComponent extends MessageListenersManager
implements OnInit,
}
ngOnInit() {
+ try {
+ this.searchHistory =
JSON.parse(localStorage.getItem(HeaderComponent.HISTORY_KEY) || '[]');
+ } catch {
+ this.searchHistory = [];
+ }
this.messageService.listConfigurations();
this.messageService.connectedStatus$.pipe(takeUntil(this.destroy$)).subscribe(status
=> {
this.connectStatus = status ? 'success' : 'error';
diff --git a/zeppelin-web/src/app/search/result-list.controller.js
b/zeppelin-web/src/app/search/result-list.controller.js
index 65c10b1f7b..ea830936e0 100644
--- a/zeppelin-web/src/app/search/result-list.controller.js
+++ b/zeppelin-web/src/app/search/result-list.controller.js
@@ -21,24 +21,73 @@ function SearchResultCtrl($scope, $routeParams,
searchService) {
$scope.searchTerm = $routeParams.searchTerm;
let results = searchService.search({'q': $routeParams.searchTerm}).query();
+ function detectLang(text) {
+ if (!text) {
+ return '';
+ }
+ // Check interpreter prefix first — this is reliable
+ if (/^%(\w*\.)?sql/i.test(text)) {
+ return 'sql';
+ }
+ if (/^%(\w*\.)?py/i.test(text)) {
+ return 'python';
+ }
+ if (/^%md/i.test(text)) {
+ return 'md';
+ }
+ if (/^%sh/i.test(text)) {
+ return 'sh';
+ }
+ // Fall back to conservative heuristics only if no prefix present.
+ // Require SELECT ... FROM pattern to avoid false positives from Python
+ // "from ... import" or markdown containing words like "create".
+ if (!text.startsWith('%')) {
+ if (/\bSELECT\b/i.test(text) && /\bFROM\b/i.test(text)) {
+ return 'sql';
+ }
+ if (/^(import |from \w+ import |def |class )/m.test(text)) {
+ return 'python';
+ }
+ }
+ return '';
+ }
+
+ // HTML-escape raw text so '<' in source (e.g. WHERE id < 100) is not parsed
+ // as a DOM tag, then promote only the Lucene <B>/<\/B> markers to <mark>.
+ function highlightToMark(text) {
+ return text
+ .replace(/&/g, '&')
+ .replace(/</g, '<')
+ .replace(/>/g, '>')
+ .replace(/<B>/gi, '<mark>')
+ .replace(/<\/B>/gi, '</mark>');
+ }
+
results.$promise.then(function(result) {
$scope.notes = result.body.map(function(note) {
- // redirect to notebook when search result is a notebook itself,
- // not a paragraph
if (!/\/paragraph\//.test(note.id)) {
return note;
}
-
note.id = note.id.replace('paragraph/', '?paragraph=') +
'&term=' + $routeParams.searchTerm;
+ let code = (note.snippet || '').replace(/<B>/g, '').replace(/<\/B>/g,
'');
+
+ let tables = (note.tables || '').trim().split(/\s+/).filter(function(t) {
+ return t;
+ }).join(', ');
+
+ note.codeText = code;
+ note.codeHtml = highlightToMark(note.snippet || '');
+ note.titleHtml = highlightToMark(note.title || '');
+ note.outputText = note.output || '';
+ note.tablesText = tables;
+ note.langBadge = detectLang(code);
+
return note;
});
- if ($scope.notes.length === 0) {
- $scope.isResult = false;
- } else {
- $scope.isResult = true;
- }
+
+ $scope.isResult = $scope.notes.length > 0;
$scope.$on('$routeChangeStart', function(event, next, current) {
if (next.originalPath !== '/search/:searchTerm') {
@@ -46,111 +95,4 @@ function SearchResultCtrl($scope, $routeParams,
searchService) {
}
});
});
-
- $scope.page = 0;
- $scope.allResults = false;
-
- $scope.highlightSearchResults = function(note) {
- return function(_editor) {
- function getEditorMode(text) {
- let editorModes = {
- 'ace/mode/scala': /^%(\w*\.)?spark/,
- 'ace/mode/python': /^%(\w*\.)?(pyspark|python)/,
- 'ace/mode/r': /^%(\w*\.)?(r|sparkr|knitr)/,
- 'ace/mode/sql': /^%(\w*\.)?\wql/,
- 'ace/mode/markdown': /^%md/,
- 'ace/mode/sh': /^%sh/,
- };
-
- return Object.keys(editorModes).reduce(function(res, mode) {
- return editorModes[mode].test(text) ? mode : res;
- }, 'ace/mode/scala');
- }
-
- let Range = ace.require('ace/range').Range;
-
- _editor.setOption('highlightActiveLine', false);
- _editor.$blockScrolling = Infinity;
- _editor.setReadOnly(true);
- _editor.renderer.setShowGutter(false);
- _editor.setTheme('ace/theme/chrome');
- _editor.getSession().setMode(getEditorMode(note.text));
-
- function getIndeces(term) {
- return function(str) {
- let indeces = [];
- let i = -1;
- while ((i = str.indexOf(term, i + 1)) >= 0) {
- indeces.push(i);
- }
- return indeces;
- };
- }
-
- let result = '';
- if (note.header !== '') {
- result = note.header + '\n\n' + note.snippet;
- } else {
- result = note.snippet;
- }
-
- let lines = result
- .split('\n')
- .map(function(line, row) {
- let match = line.match(/<B>(.+?)<\/B>/);
-
- // return early if nothing to highlight
- if (!match) {
- return line;
- }
-
- let term = match[1];
- let __line = line
- .replace(/<B>/g, '')
- .replace(/<\/B>/g, '');
-
- let indeces = getIndeces(term)(__line);
-
- indeces.forEach(function(start) {
- let end = start + term.length;
- if (note.header !== '' && row === 0) {
- _editor
- .getSession()
- .addMarker(
- new Range(row, 0, row, line.length),
- 'search-results-highlight-header',
- 'background'
- );
- _editor
- .getSession()
- .addMarker(
- new Range(row, start, row, end),
- 'search-results-highlight',
- 'line'
- );
- } else {
- _editor
- .getSession()
- .addMarker(
- new Range(row, start, row, end),
- 'search-results-highlight',
- 'line'
- );
- }
- });
- return __line;
- });
-
- // resize editor based on content length
- _editor.setOption(
- 'maxLines',
- lines.reduce(function(len, line) {
- return len + line.length;
- }, 0)
- );
-
- _editor.getSession().setValue(lines.join('\n'));
- note.searchResult = lines;
- };
- };
}
diff --git a/zeppelin-web/src/app/search/result-list.html
b/zeppelin-web/src/app/search/result-list.html
index 804fc16724..7f617a08f3 100644
--- a/zeppelin-web/src/app/search/result-list.html
+++ b/zeppelin-web/src/app/search/result-list.html
@@ -13,34 +13,32 @@ limitations under the License.
-->
<div class="searchResults">
<div ng-if="isResult" class="row">
- <div class="col-sm-8" style="margin: 0 auto; float: none">
- <ul class="search-results">
+ <div class="col-sm-8 search-result-container">
+ <ul class="search-results search-result-list">
<li class="panel panel-default" ng-repeat="note in notes">
<div class="panel-heading">
- <h4>
- <i style="font-size: 10px;" class="icon-doc"></i>
+ <h4 class="search-result-heading">
+ <i class="icon-doc"></i>
<a class="search-results-header"
href="#/notebook/{{note.id}}">
{{note.name.trim()==='' && 'Note ' + note.id.split('/',2)[0]
|| note.name}}
</a>
+ <span ng-if="note.langBadge" class="label"
+ ng-class="{'label-success': note.langBadge==='sql',
'label-warning': note.langBadge==='python'||note.langBadge==='pyspark',
'label-info': note.langBadge==='md', 'label-default':
note.langBadge!=='sql'&¬e.langBadge!=='python'&¬e.langBadge!=='pyspark'&¬e.langBadge!=='md'}"
+ style="font-size:11px">{{note.langBadge}}</span>
</h4>
</div>
- <div class="panel-body">
- <div
- class="search-result"
- ui-ace="{
- onLoad: highlightSearchResults(note),
- require: ['ace/ext/language_tools']
- }"
- ng-model="note.searchResult"
- >
- </div>
+ <div class="panel-body search-result-body">
+ <div ng-if="note.titleHtml" ng-bind-html="note.titleHtml"
class="search-result-title"></div>
+ <pre ng-if="note.codeHtml" ng-bind-html="note.codeHtml"
class="search-result-code"></pre>
+ <pre ng-if="note.outputText"
class="search-result-output">{{note.outputText}}</pre>
+ <div ng-if="note.tablesText" class="search-result-tables">Tables:
{{note.tablesText}}</div>
</div>
</li>
</ul>
</div>
</div>
<div ng-if="!isResult" class="search-no-result-found">
- <span class="glyphicon glyphicon-search"></span> We couldn’t find any
notebook matching <b>'{{searchTerm}}' </b>
+ <span class="glyphicon glyphicon-search"></span> We couldn't find any
notebook matching <b>'{{searchTerm}}' </b>
</div>
</div>
diff --git a/zeppelin-web/src/app/search/search.css
b/zeppelin-web/src/app/search/search.css
index 90a7a3f41d..9b5ceb1cf5 100644
--- a/zeppelin-web/src/app/search/search.css
+++ b/zeppelin-web/src/app/search/search.css
@@ -49,3 +49,64 @@
text-align: center;
background-color: #f4f6f8;
}
+
+.search-result-container {
+ margin: 0 auto;
+ float: none;
+}
+
+.search-result-list {
+ list-style: none;
+ padding: 0;
+}
+
+.search-result-heading {
+ margin: 0;
+ display: flex;
+ align-items: center;
+ gap: 8px;
+}
+
+.search-result-heading .icon-doc {
+ font-size: 10px;
+}
+
+.search-result-body {
+ padding: 0;
+}
+
+.search-result-title {
+ padding: 4px 12px;
+ font-size: 12px;
+ color: #777;
+}
+
+.search-result-code {
+ margin: 0;
+ border: none;
+ border-radius: 0;
+ background: #f8f9fa;
+ border-bottom: 1px solid #eee;
+ max-height: 200px;
+ overflow: auto;
+ font-size: 12px;
+}
+
+.search-result-output {
+ margin: 0;
+ border: none;
+ border-radius: 0;
+ background: #fafbfc;
+ border-left: 3px solid #ddd;
+ color: #666;
+ max-height: 120px;
+ overflow: auto;
+ font-size: 11px;
+ padding: 8px 12px;
+}
+
+.search-result-tables {
+ padding: 6px 12px;
+ font-size: 12px;
+ color: #3c763d;
+}