This is an automated email from the ASF dual-hosted git repository.

ArafatKhan2198 pushed a commit to branch HDDS-15619
in repository https://gitbox.apache.org/repos/asf/ozone-site.git

commit 0c26d34a12c0b61e68031c99f4c1ab7d143a5555
Author: arafat <[email protected]>
AuthorDate: Fri Jun 19 13:12:53 2026 +0530

    HDDS-15619. Add user documentation for Recon AI Assistant.
---
 .../02-recon/03-recon-ai-assistant.mdx             | 484 +++++++++++++++++++++
 1 file changed, 484 insertions(+)

diff --git 
a/docs/05-administrator-guide/03-operations/09-observability/02-recon/03-recon-ai-assistant.mdx
 
b/docs/05-administrator-guide/03-operations/09-observability/02-recon/03-recon-ai-assistant.mdx
new file mode 100644
index 000000000..1204c9031
--- /dev/null
+++ 
b/docs/05-administrator-guide/03-operations/09-observability/02-recon/03-recon-ai-assistant.mdx
@@ -0,0 +1,484 @@
+---
+sidebar_label: Recon AI Assistant
+---
+
+# Recon AI Assistant
+
+The **Recon AI Assistant** lets you ask questions about your Apache Ozone 
cluster in plain English
+and get answers assembled from the data Recon already collects. It is an 
optional, **disabled by
+default**, experimental feature of the Recon service.
+
+> **Note:** This page is for operators (who enable, secure, configure and run 
the assistant) and end
+> users (who ask it questions). It is not a code walkthrough; contributors can 
find the internal flow
+> in `CODE_FLOW.md` next to the chatbot source.
+
+## 1. Overview
+
+Recon continuously derives a large amount of cluster metadata — container 
health and replica state,
+namespace and usage rollups, open and pending-delete keys, datanode and 
pipeline status, background
+task and sync state — and exposes it across many REST endpoints and UI 
screens. In practice most of
+that information is never seen or correlated, because you have to know which 
endpoint or screen holds
+the answer.
+
+The assistant closes that gap: you ask a question, and it decides which Recon 
view(s) answer it, runs
+those reads, and writes back a readable summary.
+
+**What it is not:**
+
+- It is **not** a computing or analytics engine — it reports what Recon's 
endpoints return and does
+  not perform ad-hoc aggregations, joins, or math across the cluster.
+- It is **read-only** — it never mutates the cluster.
+- Its results are **bounded** (at most 1000 records per read — see 
[Limits](#11-limits--boundary-conditions)).
+- Its answers reflect Recon's **last metadata sync**, not the live cluster 
state.
+
+> **Important:** The assistant calls an **external LLM provider**, so cluster 
metadata leaves your
+> network when it is used. Read [Data sent to third-party LLM 
providers](#5-data-sent-to-third-party-llm-providers)
+> before enabling it. The feature is marked unstable and may change between 
releases.
+
+## 2. Architecture at a glance
+
+At a high level a question flows through three steps:
+
+1. **Tool selection** — the assistant asks the LLM which Recon view(s) can 
answer the question.
+2. **In-process execution** — Recon runs those reads inside the Recon JVM (no 
HTTP loopback), with
+   hard safety limits applied.
+3. **Summarization** — the raw results are sent back to the LLM, which writes 
the final answer.
+
+The assistant is **provider-agnostic**: OpenAI, Google Gemini, and Anthropic 
are all reachable behind
+one interface (see [Supported providers & 
models](#3-supported-providers--models)).
+
+## 3. Supported providers & models
+
+The assistant supports **three** LLM providers. You configure one (or more) by 
supplying an API key;
+a provider with no key is simply unavailable.
+
+| Provider | Reached via | Notes |
+|---|---|---|
+| **OpenAI** | Native OpenAI API (`https://api.openai.com`) | Standard 
chat-completions API. |
+| **Google Gemini** | Google's **OpenAI-compatible** endpoint 
(`https://generativelanguage.googleapis.com/v1beta/openai/`) | Used instead of 
the native Gemini client for reliable timeout handling. |
+| **Anthropic (Claude)** | Native Anthropic API | Sends a beta header for the 
1M-token context window (`anthropic.beta.header`). |
+
+**Default model lists** (configurable without a code change; surfaced by `GET 
/chatbot/models`):
+
+| Provider | Config key | Default models |
+|---|---|---|
+| OpenAI | `ozone.recon.chatbot.openai.models` | `gpt-4.1, gpt-4.1-mini, 
gpt-4.1-nano` |
+| Gemini | `ozone.recon.chatbot.gemini.models` | `gemini-2.5-pro, 
gemini-2.5-flash, gemini-3-flash-preview, gemini-3.1-pro-preview` |
+| Anthropic | `ozone.recon.chatbot.anthropic.models` | `claude-opus-4-6, 
claude-sonnet-4-6` |
+
+The **default selection** is provider `gemini`, model `gemini-2.5-flash`.
+
+> **Tip — reasoning vs. fast models:** "Reasoning" models (for example 
`gemini-2.5-pro`) spend output
+> tokens on internal thinking and are slower and more token-hungry; "fast" 
models (for example
+> `gemini-2.5-flash`) return snappier answers. For an interactive assistant, 
prefer a fast model as
+> the default and reserve reasoning models for harder questions.
+
+### Provider & model routing (fallback behavior)
+
+A request may name a `provider` and/or `model`, but the assistant resolves 
them against what is
+actually configured. This explains why an answer can come from a different 
model than requested:
+
+- A requested **provider** is honored only if it is **configured** (has an API 
key). Otherwise the
+  provider is inferred from the requested model; if that fails, the **default 
provider** is used.
+- A requested **model** is used only if it appears in a configured model list. 
Otherwise the
+  **default model** is used.
+- If the resolved model is not valid for the resolved provider, **both reset 
to the defaults**.
+
+## 4. Prerequisites & network egress
+
+Before enabling the assistant:
+
+- Recon is deployed and running.
+- You have an account and API key for at least one supported provider.
+- The **Recon server** can make **outbound HTTPS** calls to the provider 
endpoint(s) listed in
+  [Supported providers](#3-supported-providers--models). Only the Recon server 
needs egress — end
+  users' browsers do not.
+
+> **Note:** In firewalled or proxied environments, allowlist the provider 
hostnames on HTTPS (443),
+> or route through your outbound proxy. In air-gapped environments, either 
leave the feature disabled
+> or point the relevant `*.base.url` at an in-VPC, OpenAI-compatible gateway 
(see
+> [Configuration](#8-configuration-reference)).
+
+Each concurrent query holds one Recon worker thread for its full duration (up 
to two LLM calls plus
+up to five Recon reads), so size the thread pool to your expected concurrency 
(see
+[Configuration](#8-configuration-reference)).
+
+## 5. Data sent to third-party LLM providers
+
+Because the assistant calls an external provider, you should understand 
exactly what leaves your
+cluster before enabling it.
+
+**Transmitted to the provider:**
+
+- The user's **question text**.
+- The **system prompts** (the catalog of Recon tools and the semantic guide 
describing them).
+- The **raw JSON results** of the Recon reads used to answer — this is cluster 
**metadata** such as
+  volume / bucket / key names, paths, container and pipeline IDs, sizes, 
counts, and health states.
+- A second round-trip containing those results for **summarization**.
+
+**Not transmitted:**
+
+- Ozone object **data** (file contents) — only metadata is ever read.
+- Any credential beyond the provider's own API authentication.
+
+> **Warning:** Object **names and paths are themselves potentially sensitive** 
— real volume, bucket,
+> and key names can reveal business or data structure. The 1000-record cap 
bounds the *volume* of
+> data sent, not its sensitivity.
+
+**Controls and mitigations:**
+
+- Keep the feature **disabled** where data-egress policy forbids sending 
metadata off-cluster.
+- Encourage **scoped** queries (a specific volume/bucket/path) so less data is 
read and sent.
+- Point `*.base.url` at a **self-hosted or in-VPC** OpenAI-compatible endpoint 
to avoid public egress.
+- Review each provider's **data-retention and training** policy.
+- **Restrict access** to the Recon chat endpoint, since all users share the 
server-configured key.
+
+## 6. Managing API keys (secure vs. insecure storage)
+
+API keys are resolved **server-side only** — they are never accepted per 
request, and every chat user
+shares the single admin-configured key. (This is why you should gate who can 
reach the endpoint.)
+
+**Resolution order** (handled by Recon's `CredentialHelper`):
+
+1. The Hadoop Credential Provider (JCEKS), if 
`hadoop.security.credential.provider.path` is set.
+2. A plaintext value in `ozone-site.xml` (backward-compatible fallback).
+
+### Insecure: plaintext in `ozone-site.xml` (dev/test only)
+
+```xml
+<property>
+  <name>ozone.recon.chatbot.gemini.api.key</name>
+  <value>YOUR_API_KEY</value>
+</property>
+```
+
+> **Warning:** Plaintext keys are readable by anyone who can read 
`ozone-site.xml`. Use this only for
+> local development or testing, never in production.
+
+### Secure: Hadoop Credential Provider (JCEKS) — recommended
+
+The credential **alias must equal the config key name** (for example
+`ozone.recon.chatbot.gemini.api.key`).
+
+1. Create the keystore and add each secret:
+
+   ```bash
+   hadoop credential create ozone.recon.chatbot.gemini.api.key \
+     -provider localjceks://file/etc/security/recon-keys.jceks
+   ```
+
+   Repeat for `ozone.recon.chatbot.openai.api.key` and
+   `ozone.recon.chatbot.anthropic.api.key` as needed. The command prompts for 
the secret value.
+
+2. Point Recon at the keystore in `ozone-site.xml`:
+
+   ```xml
+   <property>
+     <name>hadoop.security.credential.provider.path</name>
+     <value>localjceks://file/etc/security/recon-keys.jceks</value>
+   </property>
+   ```
+
+3. Protect the keystore. Restrict file permissions (for example `chmod 600`, 
owned by the Recon
+   service user) and supply the keystore password out-of-band — for example via
+   `HADOOP_CREDSTORE_PASSWORD` or a password file — rather than relying on the 
default.
+
+4. Restart Recon and verify with `GET /api/v1/chatbot/health` 
(`llmClientAvailable` should be `true`).
+
+**Rotation / removal:** update or delete the alias with `hadoop credential 
create` / `hadoop
+credential delete`, then restart Recon. If a key is missing, that provider is 
simply unavailable; the
+feature still works through any other configured provider.
+
+| Environment | Recommended storage |
+|---|---|
+| Local dev / testing | `ozone-site.xml` (plaintext) |
+| Production | Hadoop Credential Provider (JCEKS) |
+
+## 7. Getting started
+
+1. Enable the feature:
+
+   ```xml
+   <property>
+     <name>ozone.recon.chatbot.enabled</name>
+     <value>true</value>
+   </property>
+   ```
+
+2. Choose a provider and model (defaults are `gemini` / `gemini-2.5-flash`).
+3. Supply an API key — see [Managing API 
keys](#6-managing-api-keys-secure-vs-insecure-storage).
+4. Restart Recon and verify:
+   - `GET /api/v1/chatbot/health`
+   - `GET /api/v1/chatbot/models`
+   - open the assistant panel in the Recon UI.
+
+When the feature is disabled, none of its components are wired in and it 
cannot affect Recon.
+
+## 8. Configuration reference
+
+All keys are under the prefix `ozone.recon.chatbot.`.
+
+### Feature toggle
+
+| Key | Default | Description |
+|---|---|---|
+| `enabled` | `false` | Master switch for the assistant. Off by default. |
+
+### Provider & model
+
+| Key | Default | Description |
+|---|---|---|
+| `provider` | `gemini` | Default provider: `openai`, `gemini`, or 
`anthropic`. |
+| `default.model` | `gemini-2.5-flash` | Default model when none is requested 
or the requested one is unavailable. |
+
+### API keys (see Section 6)
+
+| Key | Default | Description |
+|---|---|---|
+| `openai.api.key` | _(none)_ | OpenAI API key. Prefer JCEKS storage. |
+| `gemini.api.key` | _(none)_ | Gemini API key. Prefer JCEKS storage. |
+| `anthropic.api.key` | _(none)_ | Anthropic API key. Prefer JCEKS storage. |
+
+### Base URL overrides
+
+| Key | Default | Description |
+|---|---|---|
+| `openai.base.url` | `https://api.openai.com` | Override to target an 
OpenAI-compatible gateway. |
+| `gemini.base.url` | 
`https://generativelanguage.googleapis.com/v1beta/openai/` | Gemini's 
OpenAI-compatible endpoint. |
+
+### Execution policy
+
+| Key | Default | Description |
+|---|---|---|
+| `exec.require.safe.scope` | `true` | Require a bucket-scoped prefix for key 
listings. Keep enabled in production (see 
[Limits](#11-limits--boundary-conditions)). |
+| `max.tool.calls` | `5` | Maximum number of Recon reads a single question may 
trigger. |
+
+### Concurrency & timeouts
+
+| Key | Default | Description |
+|---|---|---|
+| `thread.pool.size` | `5` | Worker threads for chatbot requests. Size to 
expected concurrent users. |
+| `max.queue.size` | `10` | Requests that may wait when all threads are busy; 
beyond this, clients get HTTP 503. |
+| `timeout.ms` | `120000` | Timeout for a single provider call (ms). |
+| `request.timeout.ms` | `180000` | Overall per-request wall-clock timeout 
(ms); exceeding it returns HTTP 504. Default is 3 minutes. |
+
+### Model lists (UI dropdown)
+
+| Key | Default |
+|---|---|
+| `openai.models` | `gpt-4.1, gpt-4.1-mini, gpt-4.1-nano` |
+| `gemini.models` | `gemini-2.5-pro, gemini-2.5-flash, gemini-3-flash-preview, 
gemini-3.1-pro-preview` |
+| `anthropic.models` | `claude-opus-4-6, claude-sonnet-4-6` |
+
+### Anthropic header
+
+| Key | Default | Description |
+|---|---|---|
+| `anthropic.beta.header` | `context-1m-2025-08-07` | Anthropic beta header 
(enables the 1M-token context window). Set empty to disable. |
+
+## 9. Using the assistant — what you can ask
+
+Ask by **intent**; the assistant maps your question to the right Recon view. 
It can answer questions
+about:
+
+- **Cluster & capacity** — overall health, storage used/available.
+- **Datanodes** — inventory, health, dead/stale nodes.
+- **Pipelines** — inventory, leaders, members, state.
+- **Containers** — inventory and health: unhealthy, missing, deleted, OM/SCM 
mismatch, quasi-closed.
+- **Keys** — committed key listings, open/uncommitted keys, pending-delete 
keys, multipart uploads.
+- **Volumes & buckets** — inventory, ownership, layout, quotas.
+- **Namespace** — disk usage, object counts, quota usage, file-size 
distribution for a path.
+- **Tasks** — Recon background task and sync status.
+
+**Example questions:** "How much storage is used?", "Are any containers 
under-replicated?", "Show
+open keys in `/vol1/bucket1`", "List buckets in volume `sales`", "What is the 
disk usage of
+`/vol1/bucket1`?", "Did any Recon task fail?".
+
+**Conceptual questions** (for example "What is an FSO bucket?") are answered 
directly, without reading
+cluster data.
+
+**What it cannot do** (it will decline and suggest the nearest supported view 
rather than guess):
+per-container replica timelines, raw block-to-key mapping, any mutation, and 
arbitrary computation.
+
+> **Tip:** Name the volume and bucket, and say "open" when you mean 
uncommitted keys. FSO/OBS is a
+> bucket *layout*, not a key *state* — "FSO keys" means committed keys in an 
FSO bucket, while "open
+> FSO keys" means uncommitted keys.
+
+## 10. Tool (endpoint) reference
+
+These are the Recon views the assistant can call. This list mirrors the 
in-code catalog (see
+[Extending](#14-extending-the-assistant-for-new-recon-features)).
+
+| Group | Tool | Answers |
+|---|---|---|
+| Cluster | `api_v1_clusterState` | Overall cluster snapshot (capacity, 
counts, health). |
+| Cluster | `api_v1_datanodes` | Datanode inventory and health. |
+| Cluster | `api_v1_pipelines` | Pipeline inventory, leaders, members, state. |
+| Containers | `api_v1_containers` | General container inventory. |
+| Containers | `api_v1_containers_missing` | Missing / lost containers. |
+| Containers | `api_v1_containers_unhealthy` | All unhealthy containers 
(aggregate). |
+| Containers | `api_v1_containers_unhealthy_state` | Unhealthy containers 
filtered to one state. |
+| Containers | `api_v1_containers_deleted` | Containers deleted in SCM. |
+| Containers | `api_v1_containers_mismatch` | OM/SCM existence mismatches. |
+| Containers | `api_v1_containers_mismatch_deleted` | Deleted in SCM but still 
present in OM. |
+| Containers | `api_v1_containers_quasiClosed` | Quasi-closed containers. |
+| Containers | `api_v1_containers_unhealthy_export` | Export jobs for 
unhealthy-container data. |
+| Keys | `api_v1_keys_open` | Open / uncommitted keys (detailed). |
+| Keys | `api_v1_keys_open_summary` | Open-key totals. |
+| Keys | `api_v1_keys_open_mpu_summary` | Open multipart-upload totals. |
+| Keys | `api_v1_keys_deletePending` | Keys pending deletion (detailed). |
+| Keys | `api_v1_keys_deletePending_summary` | Pending-delete key totals. |
+| Keys | `api_v1_keys_deletePending_dirs` | Directories pending deletion. |
+| Keys | `api_v1_keys_deletePending_dirs_summary` | Pending-delete directory 
totals. |
+| Keys | `api_v1_keys_listKeys` | Committed key/file listing and filtering. |
+| Namespace | `api_v1_volumes` | Volume inventory. |
+| Namespace | `api_v1_buckets` | Bucket inventory (optionally by volume). |
+| Namespace | `api_v1_namespace_summary` | Object counts under a path. |
+| Namespace | `api_v1_namespace_usage` | Disk usage for a path. |
+| Namespace | `api_v1_namespace_quota` | Quota limit vs. usage for a path. |
+| Namespace | `api_v1_namespace_dist` | File-size distribution under a path. |
+| Utilization | `api_v1_utilization_fileCount` | File-count distribution by 
size tier. |
+| Utilization | `api_v1_utilization_containerCount` | Container-count 
distribution by size tier. |
+| Tasks | `api_v1_task_status` | Recon background task and sync status. |
+
+## 11. Limits & boundary conditions
+
+The assistant is a bounded, read-only summarizer — not a query engine. Keep 
these in mind when
+interpreting answers:
+
+- **At most 1000 records per read, no pagination.** Answers are a **sample / 
first page**, not the
+  full dataset. Narrow the scope (path prefix, filters) to see more.
+- **Not randomized.** A request for a "random sample" returns the first page 
and is presented as a
+  sample, not a true random draw.
+- **Not a computing engine.** It reports what endpoints return; it does not 
run ad-hoc aggregations,
+  joins, or math across the cluster.
+- **Safe-scope for key listings.** When `exec.require.safe.scope` is enabled 
(default), listing keys
+  requires a bucket-scoped prefix (`/<volume>/<bucket>` or deeper), preventing 
full-cluster scans.
+- **Sync freshness.** Answers reflect Recon's **last successful OM/SCM 
metadata sync**, not the live
+  cluster. Recon syncs on a configurable interval, so very recent changes may 
not appear yet; ask
+  about task/sync status (`api_v1_task_status`) to gauge freshness.
+- **Bounded concurrency and time.** Requests beyond pool + queue capacity get 
HTTP 503; requests
+  exceeding `request.timeout.ms` get HTTP 504.
+- **Honest answers.** Truncation, empty results, and sampling are called out 
in the response text.
+
+## 12. REST API endpoints
+
+The assistant is exposed under `/api/v1/chatbot`.
+
+### `POST /api/v1/chatbot/chat`
+
+Request (`model`, `provider`, and `userId` are optional):
+
+```json
+{
+  "query": "How many datanodes are healthy?",
+  "model": "gemini-2.5-flash",
+  "provider": "gemini",
+  "userId": "alice"
+}
+```
+
+Response:
+
+```json
+{ "response": "...", "success": true }
+```
+
+### `GET /api/v1/chatbot/health`
+
+Always returns HTTP 200 with the current state. `llmClientAvailable` is `true` 
only when the feature
+is enabled **and** at least one provider has a usable API key:
+
+```json
+{ "enabled": true, "llmClientAvailable": true }
+```
+
+### `GET /api/v1/chatbot/models`
+
+Returns the model lists for the configured (key-present) providers — exactly 
what the UI dropdown
+should offer. The list is empty when no provider is configured:
+
+```json
+{ "models": ["gemini-2.5-pro", "gemini-2.5-flash"] }
+```
+
+### Status codes
+
+| Code | Meaning |
+|---|---|
+| 200 | Success. |
+| 400 | Empty/blank query. |
+| 503 | Feature disabled, or the request queue is full (overloaded). |
+| 504 | Request exceeded `request.timeout.ms`. |
+| 500 | Internal error (details are logged, not returned). |
+
+The `userId` is masked in logs so identities are not leaked.
+
+## 13. Security model
+
+Defenses are layered so that even a fully prompt-injected model cannot make 
Recon do anything unsafe:
+
+- **Prompt-level** — the model is told the user message is untrusted and to 
ignore embedded
+  instructions.
+- **Allowlist** — only the registered Recon tools can ever execute.
+- **Safe-scope** — key listings require a bucket-scoped prefix (default).
+- **Record cap** — every read is capped at 1000 records.
+- **Credential isolation** — API keys are resolved server-side (see
+  [Managing API keys](#6-managing-api-keys-secure-vs-insecure-storage)); never 
per request.
+- **Resource bounds** — a bounded thread pool, queue, and per-request timeout.
+- **Read-only** — by construction, the assistant only reads Recon metadata.
+
+See also [Data sent to third-party LLM 
providers](#5-data-sent-to-third-party-llm-providers) for the
+data-egress considerations.
+
+## 14. Extending the assistant for new Recon features
+
+The assistant is built to grow with Recon:
+
+- **Tunable semantics live in resources** (the prompt files) and can be edited 
without recompiling.
+- **The tool catalog lives in code** as a small, reviewed set.
+
+To expose a **new** Recon endpoint to the assistant:
+
+1. Add it to the in-code catalog — a tool spec (name, description, 
parameters), an allowlist entry,
+   and a router case that calls the Recon bean.
+2. Document its semantics in `recon-tool-semantics.md` so the model knows when 
to choose it.
+
+An automated consistency test keeps the catalog, allowlist, and router in sync 
— adding a tool in
+only one place fails the build.
+
+> **Note:** Do not hand-edit the tuned prompt wording. The shipped prompts are 
tuned for Recon;
+> extend the semantic guide for new tools, but otherwise leave the prompts as 
they are.
+
+## 15. Prompt & resource files
+
+The editable prompt resources live in 
`hadoop-ozone/recon/src/main/resources/chatbot/`:
+
+| File | Role |
+|---|---|
+| `recon-tool-selection-prompt-preamble.txt` | Tool-selection rules and 
prompt-injection defense. |
+| `recon-tool-semantics.md` | The per-tool semantic guide. **Extend this when 
adding a tool.** |
+| `recon-summarization-prompt.txt` | Rules for formatting the final answer. |
+| `recon-fallback-prompt-template.txt` | Reply used when no tool fits / 
off-topic questions. |
+
+The shipped versions are tuned for Recon — change them deliberately.
+
+## 16. Troubleshooting & operations
+
+| Symptom | Likely cause / fix |
+|---|---|
+| Empty answer from a reasoning model (e.g. `gemini-2.5-pro`) | The model 
spent its token budget "thinking". Prefer a fast model (flash), or raise token 
limits. |
+| Answered by an unexpected model/provider | Routing fallback — the requested 
provider/model was not configured. See 
[routing](#provider--model-routing-fallback-behavior). |
+| "No API key configured" | Check the provider, the key, and 
`hadoop.security.credential.provider.path`. |
+| HTTP 504 (timeout) / HTTP 503 (overloaded) | Tune `thread.pool.size`, 
`max.queue.size`, `request.timeout.ms`. |
+| Stale answers | Recon sync lag — answers reflect the last sync. Check 
`api_v1_task_status`. |
+| Egress / connection failures | Firewall, proxy, or `*.base.url`. See 
[Prerequisites & egress](#4-prerequisites--network-egress). |
+
+Logs record request lifecycle and token counts but **not** the query text or 
any secrets.
+
+## 17. References
+
+- `CODE_FLOW.md` (internal design, for contributors).
+- Hadoop Credential Provider API and Ozone security documentation.
+- Provider documentation (OpenAI / Gemini / Anthropic), including their 
data-retention and training
+  policies.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to