This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/doris-skills.git
The following commit(s) were added to refs/heads/main by this push:
new 790efae chore: inline doriscli/Apache Doris, drop placeholder render
layer
790efae is described below
commit 790efae7fd14b78bbd6f375754252cbeecb7a742
Author: Mingyu Chen (Rayner) <[email protected]>
AuthorDate: Fri May 29 15:23:50 2026 +0800
chore: inline doriscli/Apache Doris, drop placeholder render layer
The skills now pair directly with doriscli (the Apache Doris CLI) instead of
rendering per-distribution via a placeholder/values mechanism.
- Inline the 5 placeholder tokens to their Apache Doris values
(doriscli, Apache Doris, DORIS_CLI_PATH, Apache Doris cloud mode).
- Neutralize cloud content: drop the nonexistent `cloud cluster get` command
(use `doriscli auth status` backends instead) and the CLOUD_OPS_SKILL
refs.
- Remove scaffolding: PLACEHOLDERS.md, scripts/ (render.sh + lint), and the
redundant hand-maintained AGENTS.md.
- Add CLI-CONTRACT.md documenting the doriscli commands/fields the skills
rely on.
- Rewrite README to "pairs with doriscli".
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
---
CLI-CONTRACT.md | 48 +
PLACEHOLDERS.md | 37 -
README.md | 47 +-
scripts/lint-no-brand.sh | 22 -
scripts/render.sh | 87 --
skills/doris-architecture-advisor/SKILL.md | 10 +-
.../references/decision-ingestion-strategy.md | 6 +-
.../references/decision-sizing-matrix.md | 12 +-
.../references/decision-workload-classification.md | 4 +-
.../references/example-adtech-marketing.md | 8 +-
.../references/example-gaming.md | 10 +-
.../references/example-iot-sensor-platform.md | 6 +-
.../references/example-logistics-courier.md | 6 +-
.../references/example-payment-fintech.md | 6 +-
.../references/example-retail-fashion.md | 4 +-
.../references/example-securities-analytics.md | 4 +-
.../references/example-web3-exchange.md | 2 +-
skills/doris-best-practices/AGENTS.md | 1167 --------------------
skills/doris-best-practices/SKILL.md | 50 +-
.../references/cli-investigation.md | 62 +-
.../references/start-self-hosted.md | 12 +-
21 files changed, 183 insertions(+), 1427 deletions(-)
diff --git a/CLI-CONTRACT.md b/CLI-CONTRACT.md
new file mode 100644
index 0000000..58e84ab
--- /dev/null
+++ b/CLI-CONTRACT.md
@@ -0,0 +1,48 @@
+# doriscli contract
+
+These skills are written directly against **`doriscli`** (the Apache Doris
CLI, shipped
+in the companion `doris-cli` repository). The runtime-diagnosis logic in
+`doris-best-practices` hard-codes the command names and JSON field names below.
+
+**If doriscli renames a command or output field, update both repositories
together.**
+There is no render/adapter layer to absorb the change, and a renamed field
silently
+degrades to `null` in the agent's reasoning instead of raising an error.
+
+doriscli is **optional**: when it is not installed the skills fall back to the
+MySQL-protocol + FE HTTP path (see
+`skills/doris-best-practices/references/cli-investigation.md` → "Native SQL +
HTTP
+path"). The commands and fields here are what the skills rely on when doriscli
*is*
+present.
+
+## Commands
+
+| Command | Used for |
+|---|---|
+| `doriscli sql "<q>"` (`--profile`, `--no-cache`, `--set k=v`, `-f file`) |
run SQL / `EXPLAIN` / `SHOW CREATE TABLE`; profiled execution |
+| `doriscli profile get <qid>` (`--full`, `--raw`, `-f file`) | fetch + parse
one profile |
+| `doriscli profile list` (`--active`, `--limit N`) | recent /
currently-running queries |
+| `doriscli profile diff <slow> <fast>` | operator-level regression |
+| `doriscli profile history "<pat>"` (`--days N`, `--limit N`) | p50/p99 trend
from `audit_log` |
+| `doriscli tablet <db.t>` (`--detail`, `--partition p`) | model / bucket /
sort key + tablet health |
+| `doriscli auth status` | MySQL + HTTP connectivity, backends, version |
+| `doriscli use <name>` | switch the active environment |
+| global `--format json`, `--env`, `--socks5`, `--init-sql` | machine-readable
output / routing |
+
+## JSON fields
+
+**`profile get`**
+- `summary.total_time_ms`
+- `query_stats.{total_scan_rows, spilled_operators, blocked_operators}`
+- `time_breakdown.plan`
+- `operators[].{selectivity, spilled, shuffle_bytes, join_type,
peak_mem_bytes, blocked_on_upstream, cache_hit_pct, runtime_filters}`
+- `scanned_tables.<table>.{ddl, tablet_skew, total_rows}` (object keyed by
table name)
+- `served_by`, `fetch_attempts` (present on fetch failure)
+
+**`tablet`**
+- `model`, `bucket_key`, `bucket_count`, `sort_key`, `total_rows`
+- `health.tablet_skew`
+- `columns[].ndv` (column cardinality — inlined, so a separate `SHOW COLUMN
STATS` is usually unnecessary)
+
+**`auth status`**
+- `mysql_status`, `http_status`, `http_probe`
+- `backends[].alive` (empty / all-not-alive ⇒ compute suspended or unavailable)
diff --git a/PLACEHOLDERS.md b/PLACEHOLDERS.md
deleted file mode 100644
index 75cf644..0000000
--- a/PLACEHOLDERS.md
+++ /dev/null
@@ -1,37 +0,0 @@
-# Placeholders
-
-The shared skills under `skills/` contain 5 placeholder tokens. Brand
repositories supply concrete values via `brand.values.yaml` and use
`scripts/render.sh` to produce final skill content.
-
-## Tokens
-
-| Placeholder | Type | Example (VeloDB) | Example (SelectDB) | Used in |
-|---|---|---|---|---|
-| `{{CLI}}` | string | `velocli` | `sdbcli` | All shared skills that mention
CLI commands |
-| `{{CLI_PATH_ENV}}` | string | `VELOCLI_PATH` | `SDBCLI_PATH` | CLI binary
detection in `doris-best-practices/SKILL.md` and `cli-investigation.md` |
-| `{{PRODUCT_NAME}}` | string | `VeloDB` | `SelectDB` | Product-name
references in narrative text and customer-story examples |
-| `{{CLOUD_PRODUCT_NAME}}` | string | `VeloDB Cloud` | `SelectDB Cloud` |
Cloud-product-specific behavior (e.g., "managed by … automatically") |
-| `{{CLOUD_OPS_SKILL}}` | string | `velodb-cloud` | `selectdb-cloud` |
Skill-name cross-references to the brand's Cloud operations skill |
-
-## Versioning
-
-This file is the contract between the shared repo and brand repos. If a
placeholder is added, renamed, or removed:
-
-1. Bump shared repo to a new git tag (semver MINOR for additions, MAJOR for
renames/removals)
-2. Update this file's version history below
-3. Brand repos must update their `brand.values.yaml` to match before pulling
the new shared version
-
-## Version history
-
-- **1.0** (initial) — `{{CLI}}`, `{{CLI_PATH_ENV}}`, `{{PRODUCT_NAME}}`,
`{{CLOUD_PRODUCT_NAME}}`, `{{CLOUD_OPS_SKILL}}`
-
-## Banned bare strings (CI enforcement)
-
-`scripts/lint-no-brand.sh` rejects any of these literal strings appearing in
`skills/**/*.md`:
-
-- `velocli`, `sdbcli`, `VELOCLI`, `SDBCLI`
-- `VeloDB`, `SelectDB`
-- `velodb.cloud`, `velodb.io`, `selectdb.com`, `selectdb.cn`
-- `~/.velodb`, `~/.selectdb`
-- `VELO_`, `SDB_` (as env-var prefixes — full identifiers like `$VELO_HOST`)
-
-These should all be expressed via placeholders or neutralized to "Apache
Doris" / "Cloud mode".
diff --git a/README.md b/README.md
index a400596..aba702c 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,22 @@
-# doris-skills
+# Agent Skills for Apache Doris
-Brand-neutral Apache Doris agent skills for use with Claude Code. This
repository contains the **kernel knowledge** that is identical across VeloDB
and SelectDB (both built on Apache Doris): table design, sizing, query
investigation, and architecture decisions.
+Apache Doris agent skills for use with Claude Code. This repository contains
the
+**kernel knowledge** for any Apache Doris deployment: table design, sizing,
query
+investigation, and architecture decisions.
-Cloud-platform operations (CLI auth, cluster lifecycle, billing, networking)
are **brand-specific** and live in separate brand repositories that consume
this shared content via a render step.
+The skills pair with **`doriscli`**, the Apache Doris CLI (companion
`doris-cli`
+repository). doriscli is the "hands" — it runs SQL, fetches and parses query
profiles,
+and analyzes tablet distribution, all as structured JSON; these skills are the
"brain" —
+the decision logic that reads that JSON and recommends table designs, sizing,
and fixes.
+doriscli is optional: when it is not installed the skills fall back to plain
+MySQL-protocol SQL plus the FE HTTP API, so they work with or without it (see
+`skills/doris-best-practices/references/cli-investigation.md`).
+
+The exact doriscli commands and JSON fields the skills depend on are listed in
+[CLI-CONTRACT.md](CLI-CONTRACT.md) — keep that file and doriscli in sync.
+
+Cluster-lifecycle, billing, and networking operations are managed-service
specific and
+intentionally out of scope here; use your platform's cluster-management
console for those.
## Skills
@@ -11,32 +25,35 @@ Cloud-platform operations (CLI auth, cluster lifecycle,
billing, networking) are
| `doris-best-practices` | Apache Doris table design, sizing, and runtime
query investigation (37 rules, 7 use-case templates, 4 sizing guides) |
| `doris-architecture-advisor` | Workload-aware architecture design (8
decision rules, 10 worked industry examples) |
-## How brand repositories consume this
+## How to use
-Brand repositories (e.g. `velodb-cloud-skills`, `selectdb-cloud-skills`)
include this repo as a git submodule and run `scripts/render.sh` at install
time:
+Copy the skills into your Claude Code skills directory:
```bash
-# In a brand repo:
-./doris-skills/scripts/render.sh ./brand.values.yaml ~/.claude/skills/
+cp -r skills/* ~/.claude/skills/
```
-`render.sh` substitutes the 5 placeholder tokens (see
[PLACEHOLDERS.md](PLACEHOLDERS.md)) with brand-specific values and writes
rendered skills to the target directory.
+For the structured-diagnosis path (query profiles, tablet health), also install
+doriscli from the companion `doris-cli` repository. Without it, the skills use
the
+SQL + HTTP fallback automatically.
## Authoring rules
-- **Never** write brand-specific bare strings (CLI names, hostnames, console
URLs, env-var prefixes). Use the placeholders in
[PLACEHOLDERS.md](PLACEHOLDERS.md), or neutral wording ("Apache Doris" / "Cloud
mode").
-- Cloud-mode behavior (storage-compute separation, replication_num=1, file
cache) is **Doris kernel architecture** — describe it neutrally without naming
a specific Cloud product, unless the statement is specifically about
brand-managed behavior (e.g., automatic node-count management).
-- Run `scripts/lint-no-brand.sh` before committing.
+- Write `doriscli` and `Apache Doris` directly. **Never** introduce a
downstream
+ distribution's brand strings — CLI names, console URLs, env-var prefixes,
hostnames.
+- Cloud-mode behavior (storage-compute separation, `replication_num=1`, file
cache) is
+ **Doris kernel architecture** — describe it neutrally as "cloud mode" /
"storage-compute".
+ Managed-service behavior (suspend/resume, automatic node-count management,
billing) is
+ out of scope; point at the user's cluster-management console rather than
naming a product.
+- If you rely on a new doriscli command or output field, add it to
+ [CLI-CONTRACT.md](CLI-CONTRACT.md) so the two repositories stay in sync.
## Layout
```
doris-skills/
├── README.md
-├── PLACEHOLDERS.md # Placeholder catalog + version contract
-├── scripts/
-│ ├── render.sh # Substitutes placeholders for a brand
-│ └── lint-no-brand.sh # Fails CI if bare brand strings appear
+├── CLI-CONTRACT.md # doriscli commands + JSON fields the skills
depend on
└── skills/
├── doris-best-practices/
└── doris-architecture-advisor/
diff --git a/scripts/lint-no-brand.sh b/scripts/lint-no-brand.sh
deleted file mode 100755
index c43f417..0000000
--- a/scripts/lint-no-brand.sh
+++ /dev/null
@@ -1,22 +0,0 @@
-#!/usr/bin/env bash
-# lint-no-brand.sh — fail if any brand-specific bare string leaks into
-# the shared skills/ tree. These should all be expressed via {{...}}
placeholders
-# or neutralized to "Apache Doris" / "Cloud mode".
-
-set -euo pipefail
-
-SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
-REPO_ROOT=$(dirname -- "$SCRIPT_DIR")
-SKILLS_DIR=$REPO_ROOT/skills
-
-# Bare brand strings forbidden in shared content
-FORBIDDEN='velocli|sdbcli|VELOCLI|SDBCLI|VeloDB|SelectDB|velodb\.cloud|velodb\.io|selectdb\.com|selectdb\.cn|~/\.velodb|~/\.selectdb|\$VELO_|\$SDB_'
-
-if grep -rnE --include='*.md' "$FORBIDDEN" "$SKILLS_DIR"; then
- echo ""
- echo "lint-no-brand: FAIL — bare brand strings found above." >&2
- echo "Replace with the appropriate {{PLACEHOLDER}} or use 'Apache Doris' /
'Cloud mode' wording." >&2
- exit 1
-fi
-
-echo "lint-no-brand: ok"
diff --git a/scripts/render.sh b/scripts/render.sh
deleted file mode 100755
index 0ef3937..0000000
--- a/scripts/render.sh
+++ /dev/null
@@ -1,87 +0,0 @@
-#!/usr/bin/env bash
-# render.sh — render shared-repo skills with brand placeholder values.
-#
-# Usage:
-# render.sh <brand_values_yaml> <out_dir>
-#
-# The shared skill source is assumed to live at <repo_root>/skills/ where
-# <repo_root> is the parent directory of this script.
-
-set -euo pipefail
-
-if [[ $# -ne 2 ]]; then
- echo "Usage: $0 <brand_values_yaml> <out_dir>" >&2
- exit 2
-fi
-
-VALUES_FILE=$1
-OUT_DIR=$2
-
-SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
-REPO_ROOT=$(dirname -- "$SCRIPT_DIR")
-SHARED_SKILLS_DIR=$REPO_ROOT/skills
-
-if [[ ! -f $VALUES_FILE ]]; then
- echo "error: values file not found: $VALUES_FILE" >&2
- exit 2
-fi
-if [[ ! -d $SHARED_SKILLS_DIR ]]; then
- echo "error: shared skills dir not found: $SHARED_SKILLS_DIR" >&2
- exit 2
-fi
-
-# Extract values from yaml (simple grep — only top-level scalar keys supported)
-get_val() {
- local key=$1
- grep -E "^${key}:" "$VALUES_FILE" | head -1 | sed -E
"s/^${key}:[[:space:]]*//; s/^['\"]//; s/['\"]$//"
-}
-
-CLI=$(get_val cli)
-CLI_PATH_ENV=$(get_val cli_path_env)
-PRODUCT_NAME=$(get_val product_name)
-CLOUD_PRODUCT_NAME=$(get_val cloud_product_name)
-CLOUD_OPS_SKILL=$(get_val cloud_ops_skill)
-
-# Validate all 5 placeholders are set
-missing=()
-[[ -z $CLI ]] && missing+=(cli)
-[[ -z $CLI_PATH_ENV ]] && missing+=(cli_path_env)
-[[ -z $PRODUCT_NAME ]] && missing+=(product_name)
-[[ -z $CLOUD_PRODUCT_NAME ]] && missing+=(cloud_product_name)
-[[ -z $CLOUD_OPS_SKILL ]] && missing+=(cloud_ops_skill)
-if (( ${#missing[@]} > 0 )); then
- echo "error: brand.values.yaml missing keys: ${missing[*]}" >&2
- exit 2
-fi
-
-mkdir -p "$OUT_DIR"
-
-# Render every .md under skills/ into the same path under $OUT_DIR
-find "$SHARED_SKILLS_DIR" -type f | while read -r src; do
- rel=${src#"$SHARED_SKILLS_DIR/"}
- dst=$OUT_DIR/$rel
- mkdir -p "$(dirname "$dst")"
- case $src in
- *.md)
- sed \
- -e "s|{{CLI}}|${CLI}|g" \
- -e "s|{{CLI_PATH_ENV}}|${CLI_PATH_ENV}|g" \
- -e "s|{{PRODUCT_NAME}}|${PRODUCT_NAME}|g" \
- -e "s|{{CLOUD_PRODUCT_NAME}}|${CLOUD_PRODUCT_NAME}|g" \
- -e "s|{{CLOUD_OPS_SKILL}}|${CLOUD_OPS_SKILL}|g" \
- "$src" > "$dst"
- ;;
- *)
- cp "$src" "$dst"
- ;;
- esac
-done
-
-# Verify: no {{ residue in output
-if grep -rEln '\{\{[A-Z_]+\}\}' "$OUT_DIR" >/dev/null; then
- echo "error: unrendered placeholder(s) found in output:" >&2
- grep -rEn '\{\{[A-Z_]+\}\}' "$OUT_DIR" | head >&2
- exit 3
-fi
-
-echo "render: ok -> $OUT_DIR"
diff --git a/skills/doris-architecture-advisor/SKILL.md
b/skills/doris-architecture-advisor/SKILL.md
index b60070e..4c058b7 100644
--- a/skills/doris-architecture-advisor/SKILL.md
+++ b/skills/doris-architecture-advisor/SKILL.md
@@ -14,9 +14,9 @@ description: >
Also triggers on prompts like: "design a table for...", "how should I
store...",
"build an architecture for...", "we have X devices sending data every Y
seconds",
"recommend a cluster size for...", "what data model should I use for...",
- "we need to ingest X GB/day", "migrate from MySQL/PostgreSQL to
{{PRODUCT_NAME}}".
+ "we need to ingest X GB/day", "migrate from MySQL/PostgreSQL to Apache
Doris".
Also use for legacy analytics/search/serving stack consolidation prompts even
- when {{PRODUCT_NAME}} is not named explicitly, including replacing or
migrating from
+ when Apache Doris is not named explicitly, including replacing or migrating
from
Impala, Kudu, Elasticsearch/ES, Greenplum, Presto, HBase, Hive, Hadoop,
Redis, or Lambda-style multi-engine data platforms.
---
@@ -37,7 +37,7 @@ Follow these 5 steps in order:
2. **Classify workload** — Read
`references/decision-workload-classification.md`. Match user's scenario to one
or more of the 6 workload types. Composite workloads (e.g., IoT = time-series +
device state + logs + dashboards) decompose into multiple sub-tables.
-3. **Size the cluster** — Read `references/decision-sizing-matrix.md`.
Estimate write throughput, query QPS, latency target, and hot data volume.
Output sizing as **total vCPU and total cache only** — never break down into
per-node specs (node count is managed by {{CLOUD_PRODUCT_NAME}} automatically).
Also read `references/decision-deployment-mode.md` if user hasn't specified
cloud vs on-prem.
+3. **Size the cluster** — Read `references/decision-sizing-matrix.md`.
Estimate write throughput, query QPS, latency target, and hot data volume.
Output sizing as **total vCPU and total cache only** — never break down into
per-node specs (in cloud / storage-compute mode, node count is typically
managed by the platform). Also read `references/decision-deployment-mode.md` if
user hasn't specified cloud vs on-prem.
4. **Design architecture** — Based on workload classification, read the
relevant decision rules:
@@ -66,7 +66,7 @@ Responses should include these sections (adapt formatting to
conversation):
- **Workload Summary** — Classification, write rate, QPS, latency target, hot
data volume
- **Sizing Recommendation** — Warehouse tier, storage estimate, cache strategy
-- **Architecture Overview** — Data flow from sources → ingestion →
{{PRODUCT_NAME}} → applications
+- **Architecture Overview** — Data flow from sources → ingestion → Apache
Doris → applications
- **Table Designs** — CREATE TABLE with inline comments citing decision rules
- **Rules Checked** — For each table, list the rules applied with exact file
paths so users can look up the rule for troubleshooting. Format: `Per
[rule-name](doris-best-practices/references/rule-name.md)`. Example:
```
@@ -88,7 +88,7 @@ For complete input → output examples, read:
- `references/example-cdc-operational-sync.md` — MySQL CDC, UNIQUE MoW,
sequence column
- `references/example-securities-analytics.md` — Securities firm:
ODS→DWD→DWS→ADS layering, customer 360, compliance, lakehouse, workload
isolation
- `references/example-retail-fashion.md` — Retail/fashion: omnichannel
inventory, wide+tall table for user profiling, BITMAP segmentation, multi-brand
isolation, peak season scaling
-- `references/example-logistics-courier.md` — Logistics/courier: AGGREGATE for
parcel status (MIN/MAX/REPLACE), vehicle GPS with GIS + cooldown_ttl, sorting
center KPIs, platform consolidation (Presto+Kudu+ES+HBase→{{PRODUCT_NAME}})
+- `references/example-logistics-courier.md` — Logistics/courier: AGGREGATE for
parcel status (MIN/MAX/REPLACE), vehicle GPS with GIS + cooldown_ttl, sorting
center KPIs, platform consolidation (Presto+Kudu+ES+HBase→Apache Doris)
- `references/example-web3-exchange.md` — Web3/crypto: multi-chain VARIANT
schema, custody monitoring, TVL/token async MVs, AML risk detection, wallet
profiling, session analysis
- `references/example-payment-fintech.md` — Payment/fintech: partial column
update for tx lifecycle, acquiring row-column hybrid (100+ cols), merchant
reconciliation, risk engine, log platform replacing ES, Lambda→unified
architecture
- `references/example-gaming.md` — Gaming: retention/funnel analysis, player
profiling BITMAP, NL2SQL Agentic analytics via MCP, anti-cheat anomaly
detection, lakehouse for offline data
diff --git
a/skills/doris-architecture-advisor/references/decision-ingestion-strategy.md
b/skills/doris-architecture-advisor/references/decision-ingestion-strategy.md
index 71fcdc8..3370b85 100644
---
a/skills/doris-architecture-advisor/references/decision-ingestion-strategy.md
+++
b/skills/doris-architecture-advisor/references/decision-ingestion-strategy.md
@@ -30,12 +30,12 @@ curl -u user:pass -H "label:load_20250101" \
## Group Commit
Best for high-frequency small writes that cannot batch client-side.
```sql
--- Server-side batching: {{PRODUCT_NAME}} accumulates small inserts
+-- Server-side batching: Apache Doris accumulates small inserts
-- and flushes them together
SET group_commit = async_mode;
INSERT INTO table VALUES (...);
```
-- {{PRODUCT_NAME}} buffers small inserts server-side, flushes as one batch
+- Apache Doris buffers small inserts server-side, flushes as one batch
- Reduces part creation pressure from many small writers
- IoT gateway pattern: each gateway sends small batches → Group Commit merges
@@ -59,7 +59,7 @@ Best for bulk import from object storage.
## Write Throughput Reference
-{{PRODUCT_NAME}} achieves tens of GB/s write throughput with:
+Apache Doris achieves tens of GB/s write throughput with:
- Time-series compaction (`compaction_policy = "time_series"`) —
millisecond-level compaction, near-zero memory overhead
- Columnar storage with optimized flush
- Vectorized index construction
diff --git
a/skills/doris-architecture-advisor/references/decision-sizing-matrix.md
b/skills/doris-architecture-advisor/references/decision-sizing-matrix.md
index de739e6..e7b8dfd 100644
--- a/skills/doris-architecture-advisor/references/decision-sizing-matrix.md
+++ b/skills/doris-architecture-advisor/references/decision-sizing-matrix.md
@@ -8,16 +8,16 @@ Size the cluster BEFORE producing DDL. Match the user's
workload characteristics
- "32 vCPU" means 32 vCPU total for the entire cluster
- "1 TB cache" means 1 TB total cache across all nodes
-- {{CLOUD_PRODUCT_NAME}} manages node count automatically: ≤16 vCPU = 1 node,
above 16 = multiples of 16 (e.g., 32 vCPU = 2 nodes of 16 vCPU each)
+- In cloud / storage-compute deployments, node count is typically managed for
you, mapping vCPU to nodes: ≤16 vCPU = 1 node, above 16 = multiples of 16
(e.g., 32 vCPU = 2 nodes of 16 vCPU each)
- **Do not break down into per-node specs in the output.** Just state the
total vCPU and cache.
## Important Notes
-1. **Purpose:** Quick estimation for sales/marketing to scope opportunities.
~75% accuracy — detailed sizing requires confirmation with a {{PRODUCT_NAME}}
Solutions Architect.
+1. **Purpose:** Quick first-pass estimation to scope a deployment. ~75%
accuracy — detailed sizing requires validation against the actual workload
(data volume, query concurrency, latency targets).
2. **Cache:** Recommend caching ≥50% of hot data to maintain ≥90% cache hit
rate.
3. **Replicas:** Cloud mode uses single replica for both object storage and
cache.
-Source: {{PRODUCT_NAME}} internal sizing guide.
+Source: Apache Doris internal sizing guide.
---
@@ -67,7 +67,7 @@ Log ingestion, full-text search, time-bounded queries.
## Real-time Data Warehouse (ETL/ELT)
-Incremental data processing and transformation within {{PRODUCT_NAME}}.
+Incremental data processing and transformation within Apache Doris.
| Total vCPU | Total Cache | Daily ETL jobs | 80% ETL time | 80% ETL scan |
|------------|-------------|---------------|-------------|-------------|
@@ -78,7 +78,7 @@ Incremental data processing and transformation within
{{PRODUCT_NAME}}.
## Point Query Service
-Key-value point lookups via API. **Note:** High QPS point query requires FE
resource upgrades — coordinate with {{PRODUCT_NAME}} for FE sizing.
+Key-value point lookups via API. **Note:** High QPS point query requires FE
resource upgrades — coordinate with Apache Doris for FE sizing.
| Total vCPU | Total Cache | Write MB/s | Write rows/s | Hot data | QPS |
Latency | 80% query scan |
|------------|-------------|-----------|-------------|----------|-----|---------|----------------|
@@ -129,4 +129,4 @@ Add 20-30% overhead for metadata, compaction temp space,
and safety margin.
| Medium (10-50 BE) | 1 Leader + 2 Follower | 32 GB | 16 cores |
| Large (50+ BE) | 1 Leader + 4 Follower | 64 GB | 32 cores |
-Point query workloads with >50K QPS require FE upgrades — contact
{{PRODUCT_NAME}} support.
+Point query workloads with >50K QPS require FE upgrades — contact Apache Doris
support.
diff --git
a/skills/doris-architecture-advisor/references/decision-workload-classification.md
b/skills/doris-architecture-advisor/references/decision-workload-classification.md
index 1e01e89..4fccaed 100644
---
a/skills/doris-architecture-advisor/references/decision-workload-classification.md
+++
b/skills/doris-architecture-advisor/references/decision-workload-classification.md
@@ -33,7 +33,7 @@ High-volume log ingestion with full-text search and
time-bounded queries.
- **Typical model:** DUPLICATE + inverted index + ZSTD compression
### 5. Real-time Data Warehouse / ETL
-Incremental data processing and transformation within {{PRODUCT_NAME}}.
+Incremental data processing and transformation within Apache Doris.
- **Signals:** ETL, ELT, data pipeline, dbt, materialized view, incremental
processing, ODS/DWD/DWS/ADS layers
- **Data shape:** Layered warehouse model (ODS → DWD → DWS → ADS)
- **Query pattern:** Scheduled transformations, INSERT INTO SELECT, async MV
refresh
@@ -76,7 +76,7 @@ When different teams share the same cluster:
### Lakehouse Federation (Hot/Cold Tiering)
When historical data lives in a data lake (Hive/Iceberg/Hudi):
-- Hot data (recent 1 year): stored in {{PRODUCT_NAME}} for sub-second queries
+- Hot data (recent 1 year): stored in Apache Doris for sub-second queries
- Cold data (3+ years): stays in data lake, queried via Multi-Catalog
- Accelerate repeated lake queries with async MV
- `file_cache_ttl_seconds` controls per-table cache lifetime
diff --git
a/skills/doris-architecture-advisor/references/example-adtech-marketing.md
b/skills/doris-architecture-advisor/references/example-adtech-marketing.md
index 0378b12..740b5c4 100644
--- a/skills/doris-architecture-advisor/references/example-adtech-marketing.md
+++ b/skills/doris-architecture-advisor/references/example-adtech-marketing.md
@@ -40,8 +40,8 @@ The core challenge: the bidding engine needs millisecond
access to user profiles
- **DWS (summary):** Sync MV for real-time CTR/CVR computation. Async MV for
campaign/channel/creative-level rollups.
- **ADS (application):** Pre-aggregated KPIs by advertiser × campaign × day.
Sub-second dashboard queries for 10,000+ concurrent advertisers.
-### Platform consolidation: Redis + MySQL + HBase + Hive → {{PRODUCT_NAME}}
-| Old system | Role | {{PRODUCT_NAME}} replacement |
+### Platform consolidation: Redis + MySQL + HBase + Hive → Apache Doris
+| Old system | Role | Apache Doris replacement |
|-----------|------|-------------------|
| Redis | RTB user profile serving | UNIQUE MoW + store_row_column (30K QPS,
<10ms) |
| MySQL | Campaign management | Flink CDC → UNIQUE MoW |
@@ -51,7 +51,7 @@ The core challenge: the bidding engine needs millisecond
access to user profiles
Result: TCO reduction 30-50%, unified data layer, one SQL interface.
### Cross-border / multi-region
-- {{CLOUD_PRODUCT_NAME}} supports AWS, GCP, Azure deployment in any region
+- Apache Doris cloud mode supports AWS, GCP, Azure deployment in any region
- SAAS and BYOC models for different compliance requirements (GDPR, CCPA)
- Multi-currency conversion handled in DWS layer via dimension table JOINs
- Each region can have its own compute group with workload isolation
@@ -66,5 +66,5 @@ Result: TCO reduction 30-50%, unified data layer, one SQL
interface.
| VARIANT for cross-platform JSON | `official` — different ad platforms have
different request schemas |
| Vector index for creative similarity | `official` — HNSW for ANN search on
creative embeddings |
| Workload Groups for advertiser isolation | `official` — 10K concurrent
advertisers, prevent cross-impact |
-| Dual-path (serving + analytics) | `derived` — same {{PRODUCT_NAME}} instance
serves both RTB and dashboard workloads |
+| Dual-path (serving + analytics) | `derived` — same Apache Doris instance
serves both RTB and dashboard workloads |
| Multi-Catalog for offline attribution | `official` — Hive/S3 data queried
without movement |
diff --git a/skills/doris-architecture-advisor/references/example-gaming.md
b/skills/doris-architecture-advisor/references/example-gaming.md
index 5929776..0a81eb3 100644
--- a/skills/doris-architecture-advisor/references/example-gaming.md
+++ b/skills/doris-architecture-advisor/references/example-gaming.md
@@ -35,10 +35,10 @@ Doris has built-in `retention()` and `window_funnel()`
functions for gaming anal
- Offline tags from Hive/S3 via Catalog or Broker Load
### Agentic analytics platform (NL2SQL + MCP)
-- {{PRODUCT_NAME}} as the data layer for AI Agent applications
+- Apache Doris as the data layer for AI Agent applications
- MCP (Model Context Protocol) interface for AI agents to query data
- Business domain agents: User Behavior Agent, Operations Agent, A/B Testing
Agent, Retention Agent
-- Natural language → SQL → {{PRODUCT_NAME}} → visualization, no coding required
+- Natural language → SQL → Apache Doris → visualization, no coding required
### Anti-cheat: Anomaly detection on behavior events
- DUPLICATE model for raw behavior events
@@ -47,8 +47,8 @@ Doris has built-in `retention()` and `window_funnel()`
functions for gaming anal
- Real-time alerting via scheduled queries or async MV
### Lakehouse integration for offline data
-- Real-time: MySQL → Flink CDC → {{PRODUCT_NAME}} (operational data)
-- Real-time: Kafka → Flink → {{PRODUCT_NAME}} (behavior events, real-time tags)
+- Real-time: MySQL → Flink CDC → Apache Doris (operational data)
+- Real-time: Kafka → Flink → Apache Doris (behavior events, real-time tags)
- Offline: Hive/S3 → Catalog (lakehouse federation, no data movement)
- Warehouse layers: ODS → DWD (async MV / ETL SQL) → DWS (async MV) → ADS
@@ -59,7 +59,7 @@ Doris has built-in `retention()` and `window_funnel()`
functions for gaming anal
| DUPLICATE for behavior events | `official` — append-only, fastest scan for
retention/funnel queries |
| retention() and window_funnel() | `official` — built-in Doris functions for
gaming analytics |
| BITMAP for player segmentation | `official` — sub-second audience selection
across 50M players |
-| MCP interface for Agentic analytics | `official` — {{PRODUCT_NAME}} supports
MCP for AI agent integration |
+| MCP interface for Agentic analytics | `official` — Apache Doris supports MCP
for AI agent integration |
| Sync MV for real-time KPI (1-3s) | `official` — zero-lag aggregation for
dashboard metrics |
| Async MV for near-real-time (1-10min) | `official` — incremental refresh for
complex multi-table analytics |
| Multi-Catalog for offline data | `official` — Hive/Iceberg/Paimon federation
without data movement |
diff --git
a/skills/doris-architecture-advisor/references/example-iot-sensor-platform.md
b/skills/doris-architecture-advisor/references/example-iot-sensor-platform.md
index 9ebedb9..3926383 100644
---
a/skills/doris-architecture-advisor/references/example-iot-sensor-platform.md
+++
b/skills/doris-architecture-advisor/references/example-iot-sensor-platform.md
@@ -41,12 +41,12 @@
```
Sensors → IoT Gateway → Kafka
- ├─→ Stream Load → {{PRODUCT_NAME}} (sensor_readings,
device_shadow)
+ ├─→ Stream Load → Apache Doris (sensor_readings,
device_shadow)
└─→ Cold archive → Object Storage (Minio/S3)
-MySQL (ERP/CRM) → Flink CDC → {{PRODUCT_NAME}} (business data)
+MySQL (ERP/CRM) → Flink CDC → Apache Doris (business data)
-{{PRODUCT_NAME}} → Grafana (dashboards)
+Apache Doris → Grafana (dashboards)
→ API Server (device status point queries)
→ Flink (real-time alerts)
```
diff --git
a/skills/doris-architecture-advisor/references/example-logistics-courier.md
b/skills/doris-architecture-advisor/references/example-logistics-courier.md
index 7eefa76..ef804df 100644
--- a/skills/doris-architecture-advisor/references/example-logistics-courier.md
+++ b/skills/doris-architecture-advisor/references/example-logistics-courier.md
@@ -54,9 +54,9 @@ For fleet tracking with regulatory 3-year retention:
## Platform Consolidation Mapping
-| Current system | Role | {{PRODUCT_NAME}} replacement |
+| Current system | Role | Apache Doris replacement |
|---------------|------|-------------------|
-| Presto (10,000 cores) | Ad-hoc analytics | {{PRODUCT_NAME}} MPP engine — 3x
faster, 48% fewer resources (SF Express case) |
+| Presto (10,000 cores) | Ad-hoc analytics | Apache Doris MPP engine — 3x
faster, 48% fewer resources (SF Express case) |
| Kudu | Real-time updates | UNIQUE MoW or AGGREGATE with REPLACE_IF_NOT_NULL |
| Elasticsearch | Waybill search, address search | Inverted index with unicode
parser |
| HBase | Parcel status point lookup | UNIQUE MoW + store_row_column (30K QPS
on 16-core) |
@@ -72,4 +72,4 @@ For fleet tracking with regulatory 3-year retention:
| GIS functions for geofencing | `official` — ST_CONTAINS, ST_DISTANCE_SPHERE |
| BITMAP_UNION for dock utilization time slicing | `official` — exact
time-slot dedup |
| Inverted index replacing ES | `official` — MATCH_ANY/MATCH_PHRASE for
waybill/address search |
-| Presto → {{PRODUCT_NAME}} consolidation | `field` — SF Express: 3x perf
improvement, 48% resource savings, 100% Presto migration |
+| Presto → Apache Doris consolidation | `field` — SF Express: 3x perf
improvement, 48% resource savings, 100% Presto migration |
diff --git
a/skills/doris-architecture-advisor/references/example-payment-fintech.md
b/skills/doris-architecture-advisor/references/example-payment-fintech.md
index f26b446..3eca174 100644
--- a/skills/doris-architecture-advisor/references/example-payment-fintech.md
+++ b/skills/doris-architecture-advisor/references/example-payment-fintech.md
@@ -6,7 +6,7 @@
- **Data sources:** Transaction DB (acquiring records, clearing, refunds),
user behavior (app events, web payment trails), risk data (device fingerprints,
IP, blacklists), account data (balances, card bindings), external (credit
scoring, crawled data)
- **Data volume:** 100M+ transactions/day, billions of rows per table (100+
columns for acquiring), 10-100TB/day logs
- **Query patterns:** Real-time transaction dashboard, merchant reconciliation
(P99 <2s at millions of daily queries), acquiring point lookups (10K+ QPS,
millisecond, composite key), risk detection (multi-dimensional JOINs), security
log search
-- **Special requirements:** Partial column updates (transaction status
changes), row-column hybrid storage for acquiring (point lookup + analytics on
same table), replacing Lambda architecture
(Hive+HBase+ES+Oracle→{{PRODUCT_NAME}})
+- **Special requirements:** Partial column updates (transaction status
changes), row-column hybrid storage for acquiring (point lookup + analytics on
same table), replacing Lambda architecture (Hive+HBase+ES+Oracle→Apache Doris)
## Workload Classification
@@ -52,9 +52,9 @@ Tables with 100+ columns serving both point lookups (bank
queries by composite k
- time_series compaction for sustained GB/s write with sub-second flush
### Platform consolidation: Lambda → unified architecture
-Replacing 6 systems (Hive + Spark + HBase + ES + Oracle + TiDB) with one
{{PRODUCT_NAME}} instance:
+Replacing 6 systems (Hive + Spark + HBase + ES + Oracle + TiDB) with one
Apache Doris instance:
-| Old system | Role | {{PRODUCT_NAME}} replacement |
+| Old system | Role | Apache Doris replacement |
|-----------|------|-------------------|
| Hive + Spark | Batch ETL | Async MV + INSERT INTO SELECT |
| HBase | Key-value lookups | UNIQUE MoW + store_row_column |
diff --git
a/skills/doris-architecture-advisor/references/example-retail-fashion.md
b/skills/doris-architecture-advisor/references/example-retail-fashion.md
index e0bcd3e..a93ad41 100644
--- a/skills/doris-architecture-advisor/references/example-retail-fashion.md
+++ b/skills/doris-architecture-advisor/references/example-retail-fashion.md
@@ -32,7 +32,7 @@
## Architecture
```
-Internal Systems Ingestion {{PRODUCT_NAME}}
(Real-time DW) Applications
+Internal Systems Ingestion Apache Doris
(Real-time DW) Applications
┌─────────────┐
┌─────────────────────────┐
│ CRM │── Flink CDC ────────────────────────────│ ODS (UNIQUE MoW)
│
│ ERP │ │ ods_orders,
ods_members │
@@ -162,4 +162,4 @@ Use Workload Groups to isolate brand teams:
| Workload Groups for multi-brand isolation | `official` — CPU/memory
isolation |
| Separate compute group for peak season | `derived` — elastic scaling during
Double 11 / 618 |
| Async MV for DWD→DWS→ADS ETL | `official` — replaces Spark/Hive batch
processing |
-| 64 vCPU / 3 TB cache | `field` — ~75% accuracy; confirm with
{{PRODUCT_NAME}} SA |
+| 64 vCPU / 3 TB cache | `field` — ~75% accuracy; confirm with Apache Doris SA
|
diff --git
a/skills/doris-architecture-advisor/references/example-securities-analytics.md
b/skills/doris-architecture-advisor/references/example-securities-analytics.md
index b5a93d4..36fc8ca 100644
---
a/skills/doris-architecture-advisor/references/example-securities-analytics.md
+++
b/skills/doris-architecture-advisor/references/example-securities-analytics.md
@@ -31,7 +31,7 @@
## Architecture
```
-Data Sources Data Collection Real-time DW
({{PRODUCT_NAME}}) Applications
+Data Sources Data Collection Real-time DW (Apache
Doris) Applications
┌──────────┐ ┌─────────────────────┐
│ Trading │──── Flink CDC ──────────────────────────────│ ADS (Aggregate)
│──→ KPI Dashboards
│ (Oracle) │ │ ads_revenue_daily
│──→ Management Cockpit
@@ -259,4 +259,4 @@ SET PROPERTY FOR 'risk_team' 'default_workload_group' =
'wg_risk';
| Inverted index on audit logs | `official` — compliance pattern search on
error_code + message |
| ZSTD on events and logs | `derived` — high redundancy in behavioral and
audit data |
| 3-year partition retention for compliance | `field` — regulatory
requirement; adjust per jurisdiction |
-| 64 vCPU / 3 TB cache | `field` — ~75% accuracy; confirm with
{{PRODUCT_NAME}} SA |
+| 64 vCPU / 3 TB cache | `field` — ~75% accuracy; confirm with Apache Doris SA
|
diff --git
a/skills/doris-architecture-advisor/references/example-web3-exchange.md
b/skills/doris-architecture-advisor/references/example-web3-exchange.md
index a3d60de..0a1e179 100644
--- a/skills/doris-architecture-advisor/references/example-web3-exchange.md
+++ b/skills/doris-architecture-advisor/references/example-web3-exchange.md
@@ -55,7 +55,7 @@ Different blockchains have different transaction fields. Use
VARIANT for chain-s
## Architecture
```
-On-chain Nodes / APIs Ingestion {{PRODUCT_NAME}}
Applications
+On-chain Nodes / APIs Ingestion Apache Doris
Applications
┌───────────────┐ ┌──────────────────────┐
│ Blockchain │ │ Raw Detail │
│ Event Streams │── Kafka → Flink ────────────────│ tx_events (DUP) │──→
Trade Analytics
diff --git a/skills/doris-best-practices/AGENTS.md
b/skills/doris-best-practices/AGENTS.md
deleted file mode 100644
index b0555bd..0000000
--- a/skills/doris-best-practices/AGENTS.md
+++ /dev/null
@@ -1,1167 +0,0 @@
-# Apache Doris — Complete Best Practice Reference
-
-> All 37 rules, 7 use case templates, and sizing guides.
-
----
-
-## Use Case Templates
-
-## Use Case: CDC / Operational Data Sync
-For replicating operational databases (MySQL, PostgreSQL) where rows are
frequently updated or deleted.
-### Template
-```sql
-CREATE TABLE users_sync (
- user_id BIGINT NOT NULL,
- tenant_id INT NOT NULL,
- update_time DATETIME NOT NULL,
- name VARCHAR(100),
- email VARCHAR(200),
- status TINYINT
-) ENGINE=OLAP
-UNIQUE KEY(user_id)
-DISTRIBUTED BY HASH(user_id) BUCKETS 5 -- CDC table: ~10 GB compressed
-PROPERTIES (
- "enable_unique_key_merge_on_write" = "true",
- "function_column.sequence_col" = "update_time",
- "light_schema_change" = "true"
-);
-```
-### Why This Design
-| Decision | Choice | Why |
-|----------|--------|-----|
-| **Model** | UNIQUE (MoW) | Enables UPDATE/DELETE. MoW gives fast reads
without merge-sort. |
-| **Sequence col** | `update_time` | Guarantees out-of-order CDC events don't
overwrite newer data. |
-| **Bucket** | HASH on `user_id` (= primary key) | Point lookups on PK are
pruned to one tablet. |
-| **Partition** | None | CDC tables are often small-to-medium; partition only
if > 100GB. |
-### Customization Points
-- **Large CDC tables (> 100GB):** Add `PARTITION BY RANGE(update_time) ()`
with dynamic partition
-- **Composite primary key:** Use `UNIQUE KEY(tenant_id, user_id)` for
multi-tenant data
-- **Partial column updates:** Set `"enable_unique_key_partial_update" = "true"`
-
----
-
-## Use Case: Dashboard / Pre-Aggregated Metrics
-For pre-computing dashboard numbers: daily sales, ad click totals, user
activity summaries.
-### Template
-```sql
-CREATE TABLE daily_sales_metrics (
- dt DATE NOT NULL, store_id INT NOT NULL,
- total_revenue DECIMAL(15,2) SUM DEFAULT "0",
- max_transaction DECIMAL(15,2) MAX DEFAULT "0",
- order_count BIGINT SUM DEFAULT "0",
- unique_buyers BITMAP BITMAP_UNION
-) ENGINE=OLAP AGGREGATE KEY(dt, store_id)
-PARTITION BY RANGE(dt) ()
-DISTRIBUTED BY HASH(store_id) BUCKETS 5 -- pre-aggregated metrics: ~10 GB
compressed
-PROPERTIES
("dynamic_partition.enable"="true","dynamic_partition.time_unit"="MONTH",
-
"dynamic_partition.start"="-12","dynamic_partition.end"="1","dynamic_partition.prefix"="p");
-```
-### Why This Design
-| Decision | Choice | Why |
-|----------|--------|-----|
-| **Model** | AGGREGATE | Values auto-aggregate (SUM, MAX, BITMAP_UNION) on
ingestion. |
-| **BITMAP** | `BITMAP_UNION` for unique_buyers | Exact count-distinct without
storing raw user IDs. |
-### Anti-Pattern
-```sql
--- BAD: Using DUPLICATE for dashboard data you'll always aggregate
-CREATE TABLE daily_sales_metrics (dt DATE, store_id INT, revenue DECIMAL(15,2))
-DUPLICATE KEY(dt, store_id);
--- Every query must re-aggregate billions of rows at query time.
-```
-
----
-
-## Use Case: Small Dimension / Lookup Table
-For small reference tables (< 1 GB): products, stores, countries, config
values.
-### Template
-```sql
-CREATE TABLE stores (
- store_id INT NOT NULL, region VARCHAR(20) NOT NULL,
- city VARCHAR(50), manager_name VARCHAR(100)
-) ENGINE=OLAP DUPLICATE KEY(store_id)
-DISTRIBUTED BY RANDOM BUCKETS 3
-PROPERTIES ("replication_num" = "3"); -- Use "1" for cloud mode
-```
-### Why This Design
-| Decision | Choice | Why |
-|----------|--------|-----|
-| **Model** | DUPLICATE | Dimension data is reference data — rarely updated. |
-| **Partition** | None | Table is tiny (< 1 GB). Partitioning adds overhead,
not benefit. |
-| **Bucket** | RANDOM, 3 buckets | Perfectly even distribution for small data.
|
-### Customization Points
-- **Updated dimensions:** Switch to `UNIQUE KEY(store_id)` with MoW if
dimension data changes
-- **Colocation JOINs:** Switch to `HASH(store_id)` and match the fact table's
bucket count
-
----
-
-## Use Case: Time-Series Logs & Events
-For immutable, append-only data: application logs, click events, IoT sensor
readings, audit trails.
-### Template
-```sql
-CREATE TABLE app_events (
- event_time DATETIME NOT NULL,
- app_id INT NOT NULL,
- event_type VARCHAR(50) NOT NULL,
- user_id BIGINT,
- payload STRING
-) ENGINE=OLAP
-DUPLICATE KEY(event_time, app_id, event_type)
-PARTITION BY RANGE(event_time) ()
-DISTRIBUTED BY HASH(app_id) BUCKETS 10 -- high-volume logs: ~50 GB/day
compressed
-PROPERTIES (
- "dynamic_partition.enable" = "true",
- "dynamic_partition.time_unit" = "DAY",
- "dynamic_partition.start" = "-7",
- "dynamic_partition.end" = "3",
- "dynamic_partition.prefix" = "p",
- "compression" = "zstd"
-);
-```
-### Why This Design
-| Decision | Choice | Why |
-|----------|--------|-----|
-| **Model** | DUPLICATE | Logs are immutable — no updates/deletes needed.
Fastest scan speed. |
-| **Partition** | Dynamic RANGE by DAY | Auto-creates daily partitions,
auto-drops old ones (TTL via `start`). |
-| **Bucket** | HASH on `app_id` | High cardinality, commonly filtered in
WHERE. |
-| **Sort Key** | `(event_time, app_id, event_type)` | Time-first enables range
scans; app_id enables prefix pruning. |
-| **Compression** | ZSTD | Log data has high redundancy — ZSTD compresses 2-3x
better than LZ4. |
-### Customization Points
-- **Retention:** Change `dynamic_partition.start` (e.g., `"-30"` for 30 days)
-- **Partition granularity:** `"HOUR"` for very high volume or `"MONTH"` for
low volume
-- **Bucket key:** If you mostly filter by `user_id`, use `HASH(user_id)`
instead
-- **Add text search:** Add `INDEX idx_payload(payload) USING INVERTED` for log
message search
-
----
-
-## Use Case: Observability Platform
-For building a full observability stack: logs + metrics + traces.
-### Table 1: Logs (DUPLICATE)
-```sql
-CREATE TABLE otel_logs (
- log_time DATETIME NOT NULL, service_name VARCHAR(100) NOT NULL,
- severity VARCHAR(10) NOT NULL, trace_id VARCHAR(32), span_id VARCHAR(16),
- body TEXT, resource_attributes STRING,
- INDEX idx_body(body) USING INVERTED PROPERTIES("parser" = "unicode")
-) ENGINE=OLAP DUPLICATE KEY(log_time, service_name, severity)
-PARTITION BY RANGE(log_time) ()
-DISTRIBUTED BY HASH(service_name) BUCKETS 10 -- high-volume logs: ~100 GB/day
x 0.05 ZSTD / 2 GB
-PROPERTIES
("dynamic_partition.enable"="true","dynamic_partition.time_unit"="DAY",
- "dynamic_partition.start"="-7","dynamic_partition.end"="3",
- "dynamic_partition.prefix"="p","compression"="zstd",
- "bloom_filter_columns"="trace_id");
-```
-### Table 2: Traces (DUPLICATE)
-```sql
-CREATE TABLE otel_traces (
- start_time DATETIME NOT NULL, service_name VARCHAR(100) NOT NULL,
- trace_id VARCHAR(32) NOT NULL, span_id VARCHAR(16) NOT NULL,
- parent_span_id VARCHAR(16), operation_name VARCHAR(200),
- duration_ms BIGINT, status_code TINYINT
-) ENGINE=OLAP DUPLICATE KEY(start_time, service_name, trace_id)
-PARTITION BY RANGE(start_time) ()
-DISTRIBUTED BY HASH(service_name) BUCKETS 3 -- traces: lower volume than
logs, small per partition
-PROPERTIES
("dynamic_partition.enable"="true","dynamic_partition.time_unit"="DAY",
- "dynamic_partition.start"="-7","dynamic_partition.end"="3",
- "dynamic_partition.prefix"="p","compression"="zstd",
- "bloom_filter_columns"="trace_id");
-```
-### Table 3: Metrics (AGGREGATE)
-```sql
-CREATE TABLE otel_metrics (
- metric_time DATETIME NOT NULL, service_name VARCHAR(100) NOT NULL,
- metric_name VARCHAR(200) NOT NULL,
- value DOUBLE SUM DEFAULT "0", count BIGINT SUM DEFAULT "0"
-) ENGINE=OLAP AGGREGATE KEY(metric_time, service_name, metric_name)
-PARTITION BY RANGE(metric_time) ()
-DISTRIBUTED BY HASH(service_name) BUCKETS 3 -- pre-aggregated metrics: small
per partition
-PROPERTIES
("dynamic_partition.enable"="true","dynamic_partition.time_unit"="DAY",
-
"dynamic_partition.start"="-30","dynamic_partition.end"="3","dynamic_partition.prefix"="p");
-```
-### Design Principles
-- **Shared bucket key:** All tables use `HASH(service_name)` for potential
colocation JOINs
-- **Short retention for logs/traces:** 7-day TTL; **Longer for metrics:**
30-day TTL
-- **Text search on logs:** Inverted index on `body` with unicode parser
-- **Trace correlation:** BloomFilter on `trace_id` for fast trace lookup
-
----
-
-## Use Case: User-Facing Point Query Analytics
-For serving real-time analytics through APIs with low-latency,
high-concurrency requirements.
-### Template
-```sql
-CREATE TABLE user_profiles (
- user_id BIGINT NOT NULL, tenant_id INT NOT NULL,
- name VARCHAR(100), email VARCHAR(200), last_login DATETIME,
- total_orders INT, lifetime_value DECIMAL(12,2)
-) ENGINE=OLAP UNIQUE KEY(user_id)
-DISTRIBUTED BY HASH(user_id) BUCKETS 5 -- user profiles: ~10 GB compressed
-PROPERTIES (
- "enable_unique_key_merge_on_write" = "true",
- "store_row_column" = "true",
- "light_schema_change" = "true",
- "bloom_filter_columns" = "tenant_id"
-);
-```
-### Why This Design
-| Decision | Choice | Why |
-|----------|--------|-----|
-| **Model** | UNIQUE MoW | Fast reads (no merge-sort at query time) |
-| **store_row_column** | `true` | Enables row-store mode for point queries —
reads full row from one I/O |
-| **BloomFilter** | On `tenant_id` | Skips tablets that don't contain the
tenant |
-### Optimized Point Query Pattern
-```sql
-SELECT * FROM user_profiles WHERE user_id = 12345;
--- With Prepared Statement for high concurrency:
-PREPARE stmt FROM 'SELECT * FROM user_profiles WHERE user_id = ?';
-SET @uid = 12345; EXECUTE stmt USING @uid;
-```
-
----
-
-## Use Case: Star Schema / JOIN-Heavy Analytics
-Uses colocation to ensure JOINs execute locally without network shuffle.
-### Fact Table
-```sql
-CREATE TABLE fact_orders (
- order_date DATE NOT NULL, order_id BIGINT NOT NULL,
- user_id INT NOT NULL, store_id INT NOT NULL, product_id INT NOT NULL,
- amount DECIMAL(12,2), quantity INT
-) ENGINE=OLAP DUPLICATE KEY(order_date, order_id)
-PARTITION BY RANGE(order_date) ()
-DISTRIBUTED BY HASH(store_id) BUCKETS 16
-PROPERTIES
("dynamic_partition.enable"="true","dynamic_partition.time_unit"="MONTH",
- "dynamic_partition.start"="-24","dynamic_partition.end"="1",
- "dynamic_partition.prefix"="p",
- "colocate_with"="group_orders");
-```
-### Dimension Table (colocated)
-```sql
-CREATE TABLE dim_stores (
- store_id INT NOT NULL, region VARCHAR(20), city VARCHAR(50), manager_name
VARCHAR(100)
-) ENGINE=OLAP DUPLICATE KEY(store_id)
-DISTRIBUTED BY HASH(store_id) BUCKETS 16
-PROPERTIES ("colocate_with" = "group_orders");
-```
-### Colocation Rules — ALL must match:
-1. Same `colocate_with` group name
-2. Same bucket key column(s) and same column types
-3. Same bucket count
-4. Same replication_num
-
----
-
-## Data Model Rules
-
-## AGGREGATE Model Cannot UPDATE or DELETE
-
-**Impact: CRITICAL — AGG tables do not support UPDATE or DELETE statements.**
-
-The AGGREGATE model auto-merges rows with the same key using aggregation
functions (SUM, MAX, REPLACE, etc). This means:
-- No `UPDATE` statement support
-- No `DELETE` statement support
-- No CDC pipeline compatibility (cannot reflect deletes)
-
-**Exception: REPLACE_IF_NOT_NULL** — The one case where AGG can do partial
updates. Columns with `REPLACE_IF_NOT_NULL` will only update when the incoming
value is non-NULL. This allows partial column updates in AGG model, but still
no DELETE.
-
-**Incorrect:**
-
-```sql
--- BAD: AGG for data that needs updates
-CREATE TABLE orders (
- order_id BIGINT AGGREGATE KEY,
- status VARCHAR(20) REPLACE,
- amount DECIMAL(12,2) REPLACE
-) AGGREGATE KEY(order_id);
--- Cannot DELETE cancelled orders!
-```
-
-**Correct:**
-
-```sql
--- GOOD: UNIQUE MoW for updatable data
-CREATE TABLE orders (
- order_id BIGINT NOT NULL,
- status VARCHAR(20),
- amount DECIMAL(12,2)
-) UNIQUE KEY(order_id)
-DISTRIBUTED BY HASH(order_id) BUCKETS 5 -- ~10 GB compressed
-PROPERTIES ("enable_unique_key_merge_on_write" = "true");
-```
-
-Reference: [Aggregate
Model](https://doris.apache.org/docs/table-design/data-model/aggregate)
-
----
-
-## Choose Data Model for Workload
-
-**Impact: CRITICAL — Data model cannot be changed after table creation.**
-
-| Model | Use When | Key Trait |
-|-------|----------|-----------|
-| DUPLICATE | Append-only data (logs, events, clicks) | Fastest scan, keeps
all rows |
-| UNIQUE (MoW) | Rows are updated/deleted (CDC, user data) | Dedup on primary
key, supports DELETE |
-| AGGREGATE | Pre-aggregated metrics only (counters, sums) | Auto-aggregates
on ingest |
-
-**Decision rule:** If you need UPDATE or DELETE → UNIQUE MoW. If append-only →
DUPLICATE. Only use AGGREGATE when you will *never* query raw rows.
-
-**Incorrect:**
-
-```sql
--- BAD: Using AGGREGATE for a table that needs updates
-CREATE TABLE users (
- user_id BIGINT AGGREGATE KEY,
- name VARCHAR(100) REPLACE,
- email VARCHAR(200) REPLACE
-) AGGREGATE KEY(user_id);
--- Cannot DELETE rows. Cannot run UPDATE statements.
-```
-
-**Correct:**
-
-```sql
--- GOOD: UNIQUE MoW for updatable data
-CREATE TABLE users (
- user_id BIGINT NOT NULL,
- name VARCHAR(100),
- email VARCHAR(200)
-) UNIQUE KEY(user_id)
-DISTRIBUTED BY HASH(user_id) BUCKETS 5 -- ~10 GB compressed
-PROPERTIES ("enable_unique_key_merge_on_write" = "true");
-```
-
-Reference: [Data Model
Overview](https://doris.apache.org/docs/table-design/data-model/overview)
-
----
-
-## Always Use Merge-on-Write (MoW) for UNIQUE Tables
-
-**Impact: CRITICAL — MoR requires runtime merge-sort, making reads 2-10×
slower.**
-
-Since Doris 2.1, MoW is the default. For older versions, always set explicitly:
-
-```sql
-PROPERTIES ("enable_unique_key_merge_on_write" = "true")
-```
-
-**MoW vs MoR:**
-
-| Feature | MoW | MoR |
-|---------|-----|-----|
-| Read speed | Fast (pre-merged) | Slow (merge at query) |
-| Write speed | Slightly slower | Faster writes |
-| DELETE support | Yes | Yes |
-| Partial update | Yes | Limited |
-
-**Always use MoW** unless you have a write-heavy workload with minimal reads.
-
-Reference: [Unique Key
Model](https://doris.apache.org/docs/table-design/data-model/unique)
-
----
-
-## Sequence Column for Out-of-Order CDC
-
-**Impact: HIGH — CDC events may arrive out of order; older records can
overwrite newer ones.**
-
-When using UNIQUE KEY with CDC (Flink CDC, Debezium, Canal), set a sequence
column to guarantee ordering:
-
-```sql
-CREATE TABLE users (
- user_id BIGINT NOT NULL,
- update_time DATETIME NOT NULL,
- name VARCHAR(100),
- email VARCHAR(200)
-) UNIQUE KEY(user_id)
-DISTRIBUTED BY HASH(user_id) BUCKETS 5 -- CDC table: ~10 GB compressed
-PROPERTIES (
- "enable_unique_key_merge_on_write" = "true",
- "function_column.sequence_col" = "update_time"
-);
-```
-
-**How it works:** When two rows have the same primary key, Doris keeps the one
with the *higher* sequence column value, regardless of arrival order.
-
-**When to use:** Any CDC pipeline where events may arrive out of order
(network delays, reprocessing, multiple sources).
-
-Reference: [Sequence
Column](https://doris.apache.org/docs/table-design/data-model/unique#sequence-column)
-
----
-
-## Partition Rules
-
-## AUTO PARTITION for Sporadic Data
-
-**Impact: HIGH — Avoids creating empty partitions for dates with no data.**
-
-Use AUTO PARTITION when data arrival is unpredictable (not every day/month has
data):
-
-```sql
-CREATE TABLE sparse_events (
- event_time DATETIME NOT NULL,
- user_id BIGINT,
- event_type VARCHAR(50)
-) DUPLICATE KEY(event_time, user_id)
-AUTO PARTITION BY RANGE(date_trunc(event_time, 'month'))
-()
-DISTRIBUTED BY HASH(user_id) BUCKETS 5; -- sporadic events: ~10 GB compressed
-```
-
-**When to use AUTO vs DYNAMIC:**
-- **Dynamic:** Continuous data (logs, metrics) — pre-creates future partitions
-- **Auto:** Sporadic data (user uploads, batch jobs) — creates only when data
arrives
-
-Reference: [Auto
Partition](https://doris.apache.org/docs/table-design/data-partitioning/auto-partitioning)
-
----
-
-## Dynamic Partition for Automated Data Lifecycle
-
-**Impact: HIGH — Automates partition creation and TTL-based data cleanup.**
-
-Key properties:
-- `time_unit`: DAY, WEEK, MONTH
-- `start`: Negative number = how many units to keep (TTL). E.g., `"-7"` keeps
7 days.
-- `end`: Positive number = how many future partitions to pre-create
-- `buckets`: AUTO or fixed number per partition
-
-```sql
-PROPERTIES (
- "dynamic_partition.enable" = "true",
- "dynamic_partition.time_unit" = "DAY",
- "dynamic_partition.start" = "-7",
- "dynamic_partition.end" = "3",
- "dynamic_partition.prefix" = "p"
-);
-```
-
-**Warning: Do not use dynamic partition for tables with < 20 million rows.**
For small tables, it creates wasteful empty partitions that go unnoticed and
add unnecessary metadata overhead. Use a single partition or no partition
instead.
-
-Reference: [Dynamic
Partition](https://doris.apache.org/docs/table-design/data-partitioning/dynamic-partitioning)
-
----
-
-## RANGE Partition on Time Column for Time-Series
-
-**Impact: CRITICAL — Partition key cannot be changed after table creation.**
-
-For any table with a time column (logs, events, metrics), use RANGE partition:
-
-```sql
-PARTITION BY RANGE(event_time) ()
-```
-
-Combined with dynamic partitioning, this enables:
-- **Partition pruning:** `WHERE event_time > '2025-01-01'` only scans relevant
partitions
-- **TTL:** Old partitions are auto-dropped
-- **Parallel loading:** Each partition can be loaded independently
-
-**Incorrect:**
-
-```sql
--- BAD: No partition on a time-series table
-CREATE TABLE events (
- event_time DATETIME, user_id BIGINT, data STRING
-) DUPLICATE KEY(event_time)
-DISTRIBUTED BY HASH(user_id) BUCKETS 8;
--- Every query scans ALL data, no TTL possible
-```
-
-**Correct:**
-
-```sql
-CREATE TABLE events (
- event_time DATETIME NOT NULL, user_id BIGINT, data STRING
-) DUPLICATE KEY(event_time, user_id)
-PARTITION BY RANGE(event_time) ()
-DISTRIBUTED BY HASH(user_id) BUCKETS 10 -- time-series events: ~50 GB/day
compressed
-PROPERTIES (
- "dynamic_partition.enable" = "true",
- "dynamic_partition.time_unit" = "DAY",
- "dynamic_partition.start" = "-30",
- "dynamic_partition.end" = "3",
- "dynamic_partition.prefix" = "p"
-);
-```
-
-Reference: [Range
Partition](https://doris.apache.org/docs/table-design/data-partitioning/range-partitioning)
-
----
-
-## Skip Partitioning for Small Tables (< 1 GB)
-
-**Impact: HIGH — Unnecessary partitioning on small tables wastes metadata and
hurts performance.**
-
-If a table is under 1 GB total, do not partition it. Just use bucketing:
-
-```sql
--- GOOD: Small dimension table, no partition needed
-CREATE TABLE dim_countries (
- country_code VARCHAR(3) NOT NULL,
- country_name VARCHAR(100),
- region VARCHAR(50)
-) DUPLICATE KEY(country_code)
-DISTRIBUTED BY HASH(country_code) BUCKETS 3;
-```
-
-**Rule of thumb:**
-- < 1 GB → No partition
-- 1-100 GB → Consider partition if time-series
-- > 100 GB → Always partition
-
-Reference: [Data
Partitioning](https://doris.apache.org/docs/table-design/data-partitioning/data-distribution)
-
----
-
-## Bucket Rules
-
-## Cloud Mode Requires HASH Bucketing for MoW
-
-**Impact: HIGH — UNIQUE MoW tables in cloud mode must use HASH bucketing.**
-
-In Cloud mode (storage-compute separation), RANDOM bucketing is not supported
for UNIQUE KEY tables with MoW enabled.
-
-```sql
--- GOOD: Cloud MoW with HASH
-CREATE TABLE users (...)
-UNIQUE KEY(user_id)
-DISTRIBUTED BY HASH(user_id) BUCKETS 5 -- cloud MoW: ~10 GB compressed
-PROPERTIES (
- "enable_unique_key_merge_on_write" = "true",
- "replication_num" = "1" -- cloud mode
-);
-```
-
-
----
-
-## Use Composite Bucket Key to Fix Data Skew
-
-**Impact: HIGH (Single skewed bucket key concentrates data; composite key
distributes it evenly)**
-
-When a single column has uneven distribution (some values appear much more
than others), use a composite bucket key.
-
-**Incorrect (single skewed key):**
-
-```sql
--- BAD: site_id 54321 has 80% of traffic → bucket_4 is 80% of data
-CREATE TABLE site_access (
- site_id INT, city_code INT, pv BIGINT
-) DUPLICATE KEY(site_id, city_code)
-DISTRIBUTED BY HASH(site_id) BUCKETS 16;
-```
-
-**Correct (composite key):**
-
-```sql
--- GOOD: combining site_id + city_code distributes the hot site across cities
-CREATE TABLE site_access (
- site_id INT, city_code INT, pv BIGINT
-) DUPLICATE KEY(site_id, city_code)
-DISTRIBUTED BY HASH(site_id, city_code) BUCKETS 16;
--- Data for site_id=54321 is now spread across multiple buckets by city.
-```
-
-**When to use composite keys:**
-- One column dominates traffic (e.g., one large customer)
-- Single-column hash creates visible hot tablets
-- Detect with: `SHOW TABLETS FROM table_name` and check size distribution
-
-Reference: [Data
Distribution](https://doris.apache.org/docs/table-design/data-partitioning/data-distribution)
-
----
-
-## HASH Bucket for Pruning; RANDOM Only for DUP Full-Scans
-
-**Impact: CRITICAL — Bucket key cannot be changed after table creation.**
-
-- **HASH bucket:** Filters on the bucket column prune to a single tablet →
fast point queries
-- **RANDOM bucket:** Guarantees perfectly even distribution but no query
pruning
-
-| Feature | HASH | RANDOM |
-|---------|------|--------|
-| Query pruning | Yes (WHERE on bucket col) | No |
-| Data distribution | Depends on column cardinality | Always even |
-| Supported models | ALL | DUPLICATE only |
-
-**30% Skew Rule:** If your chosen bucket column has >30% data skew (one value
dominates), you must either use RANDOM bucketing or switch to a composite key.
Detect with `SHOW TABLETS FROM table_name` and check size distribution.
-
-**Incorrect:**
-
-```sql
--- BAD: RANDOM on a UNIQUE table (not supported)
-CREATE TABLE users (...) UNIQUE KEY(user_id)
-DISTRIBUTED BY RANDOM BUCKETS 8;
-```
-
-**Correct:**
-
-```sql
--- GOOD: HASH on primary key for UNIQUE
-CREATE TABLE users (...) UNIQUE KEY(user_id)
-DISTRIBUTED BY HASH(user_id) BUCKETS 8; -- ~15 GB compressed
-
--- GOOD: RANDOM for DUP when no filter column is obvious
-CREATE TABLE raw_logs (...) DUPLICATE KEY(log_time)
-DISTRIBUTED BY RANDOM BUCKETS 3; -- small table or full-scan workload
-```
-
-Reference: [Data
Distribution](https://doris.apache.org/docs/table-design/data-partitioning/data-distribution)
-
----
-
-## Choose High-Cardinality, Low-NULL Bucket Key
-
-**Impact: HIGH — Low-cardinality keys cause uneven data distribution.**
-
-Good bucket key properties:
-1. **High cardinality** — many distinct values (user_id, order_id)
-2. **Low NULL rate** — NULLs all hash to the same bucket
-3. **Used in WHERE/JOIN** — enables partition pruning
-
-**Bad choices:** status (3 values), gender (2 values), country (200 values for
global data)
-**Good choices:** user_id, order_id, device_id, session_id
-
-Reference: [Data
Distribution](https://doris.apache.org/docs/table-design/data-partitioning/data-distribution)
-
----
-
-## Target 1-10 GB Per Tablet
-
-**Impact: HIGH — Tablet size directly affects query parallelism and metadata
overhead.**
-
-Rules:
-- **Target:** 1-10 GB per tablet (compressed)
-- **Max buckets per partition:** ≤ 64
-- **Preferred:** Calculate explicit bucket count: `daily_data_GB /
target_tablet_GB`
-
-```sql
--- GOOD: Explicit count from sizing math
--- 14 GB/day partition, target ~2 GB/tablet → 8 buckets
-DISTRIBUTED BY HASH(user_id) BUCKETS 8
-
--- 3 GB/day partition, target ~1 GB/tablet → 4 buckets
-DISTRIBUTED BY HASH(order_id) BUCKETS 4
-
--- 136 GB/day partition, target ~4 GB/tablet → 32 buckets
-DISTRIBUTED BY HASH(conn_id) BUCKETS 32
-```
-
-Always write a numeric bucket count. Automatic bucket sizing obscures intent
and may pick suboptimal counts; if volume is unknown, use a conservative
explicit fallback such as 3, 8, 16, or 32 buckets based on expected table size.
-
-**Warning signs:**
-- Tablets < 100 MB → Too many buckets, merge or reduce
-- Tablets > 20 GB → Too few buckets, increase count
-
-Reference: [Data
Distribution](https://doris.apache.org/docs/table-design/data-partitioning/data-distribution)
-
----
-
-## Sort Key Rules
-
-## Avoid FLOAT/DOUBLE in Sort Key
-
-**Impact: HIGH — FLOAT/DOUBLE columns disable prefix index and ZoneMap
pruning.**
-
-- FLOAT and DOUBLE **cannot participate** in the prefix index
-- They also **disable ZoneMap** statistics for that column
-- Use DECIMAL instead for numeric precision
-
-```sql
--- BAD: FLOAT in key breaks prefix index
-DUPLICATE KEY(price, product_id) -- price is FLOAT
-
--- GOOD: Use DECIMAL
-CREATE TABLE products (
- product_id INT NOT NULL,
- price DECIMAL(10,2) NOT NULL,
- name VARCHAR(200)
-) DUPLICATE KEY(product_id, price);
-```
-
-Reference: [Prefix
Index](https://doris.apache.org/docs/table-design/index/prefix-index)
-
----
-
-## Cluster Key to Decouple Sort from Primary Key
-
-**Impact: HIGH — UNIQUE tables sort data by primary key, but queries may
filter on other columns.**
-
-For UNIQUE MoW tables, the primary key determines dedup but the **Cluster
Key** determines physical sort order:
-
-```sql
-CREATE TABLE users (
- user_id BIGINT NOT NULL,
- region VARCHAR(20),
- name VARCHAR(100)
-) UNIQUE KEY(user_id)
-DISTRIBUTED BY HASH(user_id) BUCKETS 5 -- ~10 GB compressed
-PROPERTIES (
- "enable_unique_key_merge_on_write" = "true",
- "cluster_key" = "region, user_id"
-);
--- Dedup by user_id, but data sorted by region first (better for
region-filtered queries)
-```
-
-**When to use:** When your most common WHERE clause filters on columns that
are NOT your primary key.
-
-Reference: [Cluster
Key](https://doris.apache.org/docs/table-design/index/prefix-index#cluster-key)
-
----
-
-## Fixed-Length Types (INT/DATE) Before VARCHAR in Key
-
-**Impact: HIGH — A VARCHAR column terminates the prefix index.**
-
-Prefix index rules:
-- Max 3 columns OR 36 bytes (whichever comes first)
-- **VARCHAR terminates the index** — only the first 20 bytes of the VARCHAR
are included, and no more columns after it
-
-```sql
--- GOOD: Fixed-length types first
-DUPLICATE KEY(user_id, event_date, event_type)
--- INT(4B) DATE(4B) VARCHAR(→ terminates)
-
--- BAD: VARCHAR first wastes the prefix index
-DUPLICATE KEY(event_type, user_id, event_date)
--- VARCHAR(→ terminates after 20B, user_id and event_date NOT in
index)
-```
-
-Reference: [Prefix
Index](https://doris.apache.org/docs/table-design/index/prefix-index)
-
----
-
-## Prefix Index Limits — 3 Columns, 36 Bytes
-
-**Impact: HIGH — Columns beyond the limit get no prefix index benefit.**
-
-| Rule | Detail |
-|------|--------|
-| Max columns | 3 |
-| Max bytes | 36 |
-| VARCHAR | Terminates the index (only first 20 bytes included) |
-| FLOAT/DOUBLE | **Cannot be in prefix index** — breaks ZoneMap too |
-
-**Example:**
-
-```sql
--- user_id(BIGINT=8B) + event_date(DATE=4B) + app_id(INT=4B) = 16B, 3 cols ✓
-DUPLICATE KEY(user_id, event_date, app_id)
-
--- If you need a 4th filter column, add a separate index (BloomFilter,
Inverted)
-```
-
-Reference: [Prefix
Index](https://doris.apache.org/docs/table-design/index/prefix-index)
-
----
-
-## High-Selectivity Columns First in Sort Key
-
-**Impact: CRITICAL — Sort key order determines query performance through
prefix index.**
-
-Place columns in this order:
-1. Most frequently filtered columns first
-2. Higher selectivity (more distinct values) before lower
-3. Equality filters before range filters
-
-```sql
--- GOOD: user_id (high selectivity, common filter) first
-DUPLICATE KEY(user_id, event_time, event_type)
-
--- BAD: event_type (low selectivity) first
-DUPLICATE KEY(event_type, event_time, user_id)
-```
-
-The first 3 columns (or first 36 bytes) form the **prefix index**, which is
the primary lookup structure in Doris.
-
-Reference: [Sort
Key](https://doris.apache.org/docs/table-design/index/prefix-index)
-
----
-
-## Data Type Rules
-
-## BITMAP_UNION for Exact Count-Distinct and Funnel Analysis
-**Impact: HIGH — COUNT(DISTINCT col) on high-cardinality columns is extremely
expensive. BITMAP gives exact results with O(1) merge.**
-```sql
-CREATE TABLE daily_uv (
- dt DATE NOT NULL,
- page VARCHAR(200) NOT NULL,
- uv BITMAP BITMAP_UNION
-) AGGREGATE KEY(dt, page)
-DISTRIBUTED BY HASH(page) BUCKETS 5; -- ~10 GB compressed
--- Insert with to_bitmap():
-INSERT INTO daily_uv SELECT '2025-01-01', '/home', to_bitmap(user_id) FROM
events;
--- Query exact UV:
-SELECT dt, bitmap_count(uv) AS unique_visitors FROM daily_uv GROUP BY dt;
-```
-**Funnel / Retention with bitmap_intersect():**
-```sql
-SELECT bitmap_count(bitmap_intersect(uv)) AS retained_users
-FROM daily_uv WHERE dt IN ('2025-01-01', '2025-01-02');
-```
-**Orthogonal analysis:** Use INTERSECT_COUNT for multi-dimensional user
segmentation.
-Reference:
[BITMAP](https://doris.apache.org/docs/sql-manual/data-types/aggregate/BITMAP)
-
----
-
-## Doris-Specific Type Gotchas
-**DATETIME precision:** Default DATETIME is seconds. Use `DATETIME(3)` for
milliseconds, `DATETIME(6)` for microseconds.
-**STRING vs VARCHAR:** STRING cannot be used as key, partition, or bucket
column. Use VARCHAR for keyed columns.
-**VARCHAR(65533) has identical performance to VARCHAR(255)** — Doris uses
variable-length storage, so there's no penalty for a larger max. When unsure,
use VARCHAR(65533).
-**CHAR vs VARCHAR:** CHAR pads with spaces to fixed length. Only use CHAR for
truly fixed-width codes (country_code CHAR(3)). Otherwise, prefer VARCHAR.
-```sql
--- GOOD: Appropriate types
-CREATE TABLE t (
- ts DATETIME(3), -- millisecond precision
- country CHAR(3), -- fixed 3-char code
- name VARCHAR(65533), -- variable, no perf penalty
- big_text STRING -- only for non-key, non-partitioned columns
-);
-```
-
----
-
-## Use Native Types, Not STRING for Everything
-**Impact: HIGH — STRING columns cannot be used as partition, bucket, or sort
key columns.**
-Always use the most specific type: INT for numbers, DATE/DATETIME for
timestamps, DECIMAL for money.
-```sql
--- BAD
-CREATE TABLE t (id STRING, ts STRING, amount STRING);
--- GOOD
-CREATE TABLE t (id BIGINT, ts DATETIME(3), amount DECIMAL(12,2));
-```
-Reference: [Data
Types](https://doris.apache.org/docs/sql-manual/data-types/overview)
-
----
-
-## VARIANT Type for Semi-Structured JSON Data
-**Impact: HIGH — VARIANT provides columnar storage for JSON with automatic
type inference.**
-Use VARIANT instead of JSON or STRING for semi-structured data:
-```sql
-CREATE TABLE events (
- event_time DATETIME NOT NULL,
- event_id BIGINT NOT NULL,
- data VARIANT
-) DUPLICATE KEY(event_time, event_id)
-DISTRIBUTED BY HASH(event_id) BUCKETS 5; -- ~10 GB compressed
--- Query nested fields directly:
-SELECT data['user']['name'], data['action'] FROM events;
-```
-**Schema Template:** Pre-define expected fields for better columnar storage:
-```sql
-ALTER TABLE events SET ("variant_schema_template" = '{"user.name": "STRING",
"action": "STRING", "amount": "DOUBLE"}');
-```
-**Inverted Index on VARIANT fields:**
-```sql
-ALTER TABLE events ADD INDEX idx_action(CAST(data['action'] AS VARCHAR)) USING
INVERTED;
-```
-**MATCH search on VARIANT text fields:**
-```sql
-SELECT * FROM events WHERE CAST(data['message'] AS VARCHAR) MATCH 'error
timeout';
-```
-Reference: [VARIANT
Type](https://doris.apache.org/docs/sql-manual/data-types/semi-structured/VARIANT)
-
----
-
-## JSON/ARRAY/Complex Types Disable ZoneMap
-**Impact: HIGH — ZoneMap statistics are not generated for JSON, ARRAY, MAP,
STRUCT, or STRING columns.**
-ZoneMap stores min/max values per data page, enabling skip scanning. Complex
types disable this optimization.
-**Workaround:** Extract frequently filtered fields into dedicated columns with
native types.
-```sql
--- BAD: Filtering on JSON field — no ZoneMap, full scan
-SELECT * FROM events WHERE payload->'$.status' = 'error';
--- GOOD: Extract to a native column
-CREATE TABLE events (
- ..., status VARCHAR(20), payload JSON,
- INDEX idx_status(status) USING INVERTED
-);
-SELECT * FROM events WHERE status = 'error';
-```
-
----
-
-## Table Properties
-
-## Properties Cloud Mode Forces
-In Cloud mode (storage-compute separation):
-- `replication_num` is forced to `1` (data stored in object storage)
-- HASH bucketing required for UNIQUE MoW
-- File cache controls performance (see `schema-cache-file-cache`)
-```sql
-PROPERTIES ("replication_num" = "1");
-```
-
----
-
-## LZ4 Default vs ZSTD for Logs/Cold Data
-| Algorithm | Speed | Ratio | Use When |
-|-----------|-------|-------|----------|
-| LZ4 | Fastest | Lower | Default, hot data, low-latency |
-| ZSTD | Slower | 2-3× better | Logs, cold data, archival |
-```sql
--- For log/event tables with high redundancy:
-PROPERTIES ("compression" = "zstd");
--- For hot analytical tables (default):
-PROPERTIES ("compression" = "lz4");
-```
-
----
-
-## Index Rules
-
-## Bitmap Index for Medium-Cardinality Dimensions (100-100K distinct values)
-**Impact: MEDIUM — Efficient for columns with moderate cardinality used in
filters.**
-```sql
-CREATE INDEX idx_city ON table_name(city) USING BITMAP;
-```
-**Sweet spot:** 100 to 100,000 distinct values (status codes, cities,
categories).
-**Restrictions:**
-- Only one bitmap index can be created at a time (sequential schema change)
-- `DROP INDEX` is also a schema change and takes significant time
-- Only on value columns, not key columns
-- Segment V2 format required
-
----
-
-## BloomFilter for High-Cardinality Equality Filters
-**Impact: HIGH — Skips data pages that definitely don't contain the filtered
value.**
-Use for columns with ≥ 5000 distinct values, filtered with `=` or `IN`.
-```sql
--- Add BloomFilter index
-PROPERTIES ("bloom_filter_columns" = "trace_id, session_id");
-```
-**Constraints:**
-- NOT supported on TINYINT, FLOAT, or DOUBLE columns
-- Only accelerates `=` and `IN` filters (not LIKE, not range)
-- Minimum recommended cardinality: 5000+ distinct values
-- False positive rate ~1% (configurable via bloom_filter_fpp)
-- Do not use inline `INDEX ... USING BLOOM FILTER` in generated DDL; use the
table property above.
-
----
-
-## Inverted Index for Text Search and Range on Non-Key Columns
-**Impact: HIGH — Enables full-text search and efficient range filters without
modifying the sort key.**
-```sql
--- Text search with parser
-INDEX idx_body(body) USING INVERTED PROPERTIES("parser" = "unicode")
--- Equality/range on non-key column
-INDEX idx_status(status) USING INVERTED
-```
-**Parser options:** `none` (exact), `english`, `unicode` (multilingual),
`chinese` (CJK).
-**Supported filter types:** `=`, `IN`, `>`, `<`, `>=`, `<=`, `MATCH_ALL`,
`MATCH_ANY`, `MATCH_PHRASE`.
-Reference: [Inverted
Index](https://doris.apache.org/docs/table-design/index/inverted-index)
-
----
-
-## NGram BloomFilter for LIKE '%pattern%' Queries
-**Impact: HIGH — LIKE '%pattern%' causes full table scan without NGram index.**
-```sql
-INDEX idx_url(url) USING NGRAM_BF PROPERTIES("gram_size" = "3", "bf_size" =
"1024")
-```
-The NGram BloomFilter breaks text into 3-character grams and uses a bloom
filter to skip non-matching pages.
-**When to use:** `LIKE '%keyword%'` queries on VARCHAR/STRING columns.
-**When NOT to use:** Exact match (`=`) → use regular BloomFilter. Full-text
search → use Inverted Index.
-Reference: [NGram
BloomFilter](https://doris.apache.org/docs/table-design/index/ngram-bloomfilter-index)
-
----
-
-## Full-Text Search with MATCH Functions and BM25 Scoring
-**Impact: HIGH — Enables search-engine-like text queries with relevance
ranking.**
-**7 MATCH operators:** MATCH_ALL, MATCH_ANY, MATCH_PHRASE,
MATCH_PHRASE_PREFIX, MATCH_PHRASE_EDGE, MATCH_REGEXP, MATCH_ELEMENT_EQ.
-```sql
--- Setup: Inverted index with parser
-INDEX idx_content(content) USING INVERTED PROPERTIES(
- "parser" = "unicode", "support_phrase" = "true"
-)
--- Queries (WHERE clause only):
-WHERE content MATCH_ALL 'database analytics' -- all terms
-WHERE content MATCH_ANY 'database analytics' -- any term
-WHERE content MATCH_PHRASE 'real time analytics' -- exact phrase
-```
-**SEARCH() unified DSL:** Combine operators in one function:
-```sql
-WHERE SEARCH(content, '"real time" +analytics -legacy', 'parser=unicode')
-```
-**BM25 scoring:** Rank results by relevance:
-```sql
-SELECT doc_id, score() AS relevance FROM docs
-WHERE content MATCH_ANY 'doris analytics' ORDER BY relevance DESC;
-```
-**Custom analyzers, hybrid text+vector search, and VARIANT text search** are
all supported.
-Reference: [Full-Text
Search](https://doris.apache.org/docs/table-design/index/inverted-index)
-
----
-
-## HNSW/IVF Vector Index for ANN Search
-**Impact: HIGH — Enables approximate nearest neighbor search on embeddings.**
-Requires Doris 4.0+ with vector support.
-```sql
-CREATE TABLE embeddings (
- doc_id BIGINT NOT NULL,
- content VARCHAR(65533),
- embedding ARRAY<FLOAT> NOT NULL
-) DUPLICATE KEY(doc_id)
-DISTRIBUTED BY HASH(doc_id) BUCKETS 5; -- ~10 GB compressed
--- Add HNSW index:
-CREATE INDEX idx_vec ON embeddings(embedding) USING HNSW
-PROPERTIES("dim" = "768", "metric" = "cosine", "m" = "16", "ef_construction" =
"200");
--- Query:
-SELECT doc_id, l2_distance(embedding, [0.1, 0.2, ...]) AS dist
-FROM embeddings ORDER BY dist LIMIT 10;
-```
-**Index types:** HNSW (fast, memory-heavy), IVF_FLAT (balanced), IVF_PQ
(compressed).
-
----
-
-## Materialized Views
-
-## Async MV for Multi-Table JOIN Acceleration
-**Impact: HIGH — Pre-computes JOINs so queries read a flat table instead of
joining at runtime.**
-```sql
-CREATE MATERIALIZED VIEW mv_order_details
-REFRESH AUTO ON SCHEDULE EVERY 10 MINUTE
-AS SELECT o.order_id, o.amount, p.product_name, c.customer_name
-FROM orders o
-JOIN products p ON o.product_id = p.product_id
-JOIN customers c ON o.customer_id = c.customer_id;
-```
-**Refresh modes:** AUTO or COMPLETE with ON SCHEDULE (periodic), ON COMMIT (on
base table change, limit: ≤5 updates/hr), or manual refresh.
-Reference: [Async Materialized
View](https://doris.apache.org/docs/query-acceleration/materialized-view/async-materialized-view)
-
----
-
-## Async MV Operational Limits
-| Limit | Value |
-|-------|-------|
-| Max rows per MV | ~50 million |
-| Max JOINs | 2 |
-| Max partitions | 30 |
-| Max concurrent refreshes | 3 |
-| Cluster resource cap | 40% |
-| ON COMMIT limit | ≤ 5 updates/hour |
-**Capacity estimation:** ~20-30 active async MVs on a 3-node cluster.
-**Layered design pattern:** Build MVs on top of other MVs (Layer 1: base
aggregations, Layer 2: cross-table joins).
-**partition_sync_limit:** Focus refresh on recent data only:
-```sql
-CREATE MATERIALIZED VIEW mv_recent
-PROPERTIES ("partition_sync_limit" = "7")
-REFRESH AUTO ON SCHEDULE EVERY 1 HOUR
-AS SELECT ... FROM orders ...;
-```
-
----
-
-## Sync MV (Rollup) for Single-Table Aggregation
-**Impact: HIGH — Pre-aggregates data; optimizer rewrites queries
automatically.**
-```sql
-CREATE MATERIALIZED VIEW mv_daily_sales AS
-SELECT dt, store_id, SUM(amount) AS total, COUNT(*) AS cnt
-FROM orders GROUP BY dt, store_id;
-```
-**Restriction: NOT supported on UNIQUE KEY tables.** Use async MVs instead.
-Sync MVs are maintained synchronously with the base table — zero lag.
-Reference: [Sync Materialized
View](https://doris.apache.org/docs/query-acceleration/materialized-view/sync-materialized-view)
-
----
-
-## Caching Rules
-
-## File Cache Strategies (Cloud / Storage-Compute Separation)
-**Key metric: 60% of data cached locally = 95% cache hit rate.** Maintain 90%+
hit rate.
-Cache modes:
-- **LRU (default):** Evicts least recently used. Good for uniform access.
-- **TTL:** Time-based eviction. Good for time-series with clear hot/cold
boundary.
-**Table-level cache control:**
-```sql
--- Keep dimension tables cached forever
-ALTER TABLE dim_stores SET ("file_cache_ttl_seconds" = "0"); -- 0 = never
evict
--- Hot window for fact tables
-ALTER TABLE fact_orders SET ("file_cache_ttl_seconds" = "604800"); -- 7 days
-```
-**IOPS guidance:** SSD cache: ~500 IOPS/disk. HDD: ~200 IOPS/disk.
-
----
-
-## Query Cache and Partition Cache
-**Query cache:** Identical SQL → instant response (no computation).
-```sql
-SET enable_query_cache = true;
-```
-**Partition cache:** Only recomputes partitions with new data. Ideal for
time-series dashboards where most partitions are historical and unchanged.
-```sql
-SET enable_partition_cache = true;
-```
-**When to use:**
-- Query cache: Repeated identical queries (dashboard auto-refresh)
-- Partition cache: Time-series with mostly historical data
-
----
-
-## Sizing Guides
-
-## BE Sizing — Cloud / Storage-Compute Separation
-In cloud mode, data is in object storage (S3/GCS). BEs only cache hot data
locally.
-**CPU:** Same as integrated — determines query parallelism.
-**Memory:** 32-128 GB. Used for query execution, not data storage.
-**Local disk:** SSD for file cache. Size based on hot data ratio (60% cached =
95% hit rate).
-**Elasticity:** Can scale BE nodes independently from data volume. Add BEs for
more compute, not more storage.
-
----
-
-## BE Sizing — Integrated Storage Mode
-In integrated mode, each BE stores data locally on disk.
-**CPU:** 16-64 cores per BE. More cores = more concurrent query/scan threads.
-**Memory:** 64-256 GB per BE. Rule of thumb: 4 GB RAM per 1 TB stored data.
-**Disk:** SSD recommended. 1-10 TB per BE. Use RAID or multiple disks for
throughput.
-**Node count:** Start with 3 nodes for HA. Scale horizontally for more
throughput.
-
----
-
-## FE Node Sizing
-| Cluster Size | FE Nodes | Memory | CPU |
-|-------------|----------|--------|-----|
-| Small (< 10 BE) | 1 Leader + 2 Follower | 16 GB | 8 cores |
-| Medium (10-50 BE) | 1 Leader + 2 Follower | 32 GB | 16 cores |
-| Large (50+ BE) | 1 Leader + 4 Follower | 64 GB | 32 cores |
-FE nodes store metadata (table schemas, partitions, tablets) in memory. Scale
FE memory with table count and partition count rather than data volume.
-
----
-
-## Storage Calculation Formula
-```
-Raw data × compression ratio × replication_num = Required storage
-```
-**Compression ratios (typical):**
-| Data Type | LZ4 Ratio | ZSTD Ratio |
-|-----------|-----------|------------|
-| Structured (numeric) | 3-5× | 5-8× |
-| Logs/text | 5-10× | 10-20× |
-| JSON/semi-structured | 3-8× | 5-12× |
-**Example:** 1 TB/day raw logs × 30 days retention × ZSTD (10× compression) ×
1 replica = 3 TB storage
-**Include overhead:** Add 20-30% for metadata, compaction temp space, and
safety margin.
-
----
-
-## Getting Started
-
-> For {{CLOUD_PRODUCT_NAME}} setup, see the `{{CLOUD_OPS_SKILL}}` skill.
-
-## Getting Started — Self-Hosted / BYOC / On-Prem
-### Prerequisites
-- FE nodes: Java 8+ runtime
-- BE nodes: Linux with sufficient disk, memory
-- Network: FE and BE nodes must be able to communicate
-### Deployment Steps
-1. Deploy FE nodes (1 Leader + 2 Followers for HA)
-2. Deploy BE nodes (3+ for production)
-3. Register BEs with FE: `ALTER SYSTEM ADD BACKEND "<be_host>:9050";`
-4. Create database and tables
-### Connect
-```bash
-mysql -h <fe_host> -P 9030 -u root
-```
-### Self-Hosted Properties
-```sql
-PROPERTIES ("replication_num" = "3"); -- 3 replicas for HA
-```
-Reference: [Apache Doris
Installation](https://doris.apache.org/docs/install/cluster-deployment/standard-deployment)
-
----
diff --git a/skills/doris-best-practices/SKILL.md
b/skills/doris-best-practices/SKILL.md
index 26409f5..50a3a47 100644
--- a/skills/doris-best-practices/SKILL.md
+++ b/skills/doris-best-practices/SKILL.md
@@ -13,23 +13,23 @@ description: >
stacks such as Impala, Kudu, Elasticsearch/ES, Greenplum, Presto, HBase,
Hive, Hadoop, Redis, or Lambda-style multi-engine data platforms, even when
Apache Doris is not named explicitly.
- Also use when user provides a {{PRODUCT_NAME}} connection string or asks to
get started.
+ Also use when user provides an Apache Doris connection string or asks to get
started.
Also triggers on slow query investigation, query profiling, runtime
performance
diagnosis, tablet skew analysis, and table health checks — any scenario where
runtime evidence (profile output, tablet distribution) informs optimization.
- For Cloud operations (auth, cluster lifecycle, billing, networking), defer to
- the {{CLOUD_OPS_SKILL}} skill.
+ Cluster lifecycle, billing, and networking are managed-service operations,
out of
+ scope here — use your platform's cluster-management console for those.
license: Apache-2.0
metadata:
- author: Apache Doris Community
- version: "3.1.0"
+ author: tomz-alt
+ version: "0.1.0"
---
# Apache Doris Best Practices
> Problem-first table design intelligence for Apache Doris.
> 37 rules, 7 use case templates, 4 sizing guides.
-> All details in `references/` directory and compiled `AGENTS.md`.
+> All details in `references/` directory.
---
@@ -209,27 +209,29 @@ PROPERTIES (
## 3 ▸ Connection & CLI
-### Detect CLI
+Apache Doris speaks the MySQL protocol, so the **always-available** path is
any MySQL-compatible client (`mysql`) plus SQL and the FE HTTP REST API. Some
distributions also ship an **optional management CLI** (referred to here as
`doriscli`) that adds ergonomic profiling and diagnostics commands — use it
when your distribution provides one, otherwise use the native path.
-Before running any queries, detect the CLI binary:
+### Detect the optional CLI
-1. Check `{{CLI_PATH_ENV}}` env var — if set, use that binary path
-2. `command -v {{CLI}}` — use from PATH
+Before running any queries, detect whether the CLI binary is available:
+
+1. Check `DORIS_CLI_PATH` env var — if set, use that binary path
+2. `command -v doriscli` — use from PATH
3. If none available: fall back to `mysql` client (see `references/start-*.md`)
-### When {{CLI}} is available, prefer it for all operations:
+### When doriscli is available, prefer it for all operations:
-| Task | {{CLI}} Command |
+| Task | doriscli Command |
|------|-----------------|
-| Run SQL | `{{CLI}} sql "SELECT ..."` |
-| DDL inspection | `{{CLI}} sql "SHOW CREATE TABLE db.t"` |
-| Table/tablet health | `{{CLI}} tablet db.t` (overview) or `{{CLI}} tablet
db.t --detail` |
-| Profile a slow query | `{{CLI}} sql "SELECT ..." --profile` → captures
query_id |
-| Get query profile | `{{CLI}} profile get <qid>` or `--full` for complete
diagnosis |
-| Compare fast vs slow | `{{CLI}} profile diff <slow_qid> <fast_qid>` |
-| Performance trend | `{{CLI}} profile history <sql_pattern> --days 7` |
-| Test connection | `{{CLI}} auth status` |
-| Switch environment | `{{CLI}} use <name>` |
+| Run SQL | `doriscli sql "SELECT ..."` |
+| DDL inspection | `doriscli sql "SHOW CREATE TABLE db.t"` |
+| Table/tablet health | `doriscli tablet db.t` (overview) or `doriscli tablet
db.t --detail` |
+| Profile a slow query | `doriscli sql "SELECT ..." --profile` → captures
query_id |
+| Get query profile | `doriscli profile get <qid>` or `--full` for complete
diagnosis |
+| Compare fast vs slow | `doriscli profile diff <slow_qid> <fast_qid>` |
+| Performance trend | `doriscli profile history <sql_pattern> --days 7` |
+| Test connection | `doriscli auth status` |
+| Switch environment | `doriscli use <name>` |
### Runtime Query Investigation
@@ -238,15 +240,15 @@ For slow queries or runtime performance issues, read
`references/cli-investigati
- **Evidence first is mandatory**: collect or attempt profile, tablet, DDL,
stats, EXPLAIN, history, active-query, or connection evidence before forming
hypotheses. If evidence cannot be collected locally, state that and provide the
exact commands to run
- **Prefer existing profiles**: use `profile get <query_id>`, `profile list`,
or `profile history` before re-executing SQL
- **Proactive discovery**: for vague slow-query reports, start with `auth
status`, `profile list --active`, and recent `profile list` before asking the
user for more context
-- **Safety gate**: before running user SQL with `--profile`, check whether it
is safe (no DDL, no mutation, no unbounded scan). For unknown, peak-hour, or
expensive SQL, run `{{CLI}} sql "EXPLAIN <query>" --format json` first and ask
confirmation or request an existing query_id
+- **Safety gate**: before running user SQL with `--profile`, check whether it
is safe (no DDL, no mutation, no unbounded scan). For unknown, peak-hour, or
expensive SQL, run `doriscli sql "EXPLAIN <query>" --format json` first and ask
confirmation or request an existing query_id
- **Hypotheses, not verdicts**: diagnostic mappings are heuristics. Present
evidence, likely cause, what to check next, and when the conclusion may be wrong
-- If {{CLI}} is unavailable, fall back to SQL commands listed in the reference
+- If doriscli is unavailable, fall back to SQL commands listed in the reference
- Always use `--format json` for structured agent-readable output
### Quick-start guides
- `references/start-self-hosted.md` — Self-hosted / BYOC / on-prem
-- For {{CLOUD_PRODUCT_NAME}} setup, see the `{{CLOUD_OPS_SKILL}}` skill
+- Cloud mode (storage-compute) connection differs only in the HTTP port (8080
vs 8030) — same guide applies
---
diff --git a/skills/doris-best-practices/references/cli-investigation.md
b/skills/doris-best-practices/references/cli-investigation.md
index 7c9d317..77295cc 100644
--- a/skills/doris-best-practices/references/cli-investigation.md
+++ b/skills/doris-best-practices/references/cli-investigation.md
@@ -5,16 +5,18 @@ tags: [cli, investigation, diagnostics, profiling, slow-query]
## CLI-Based Query Investigation
-Evidence-first runtime diagnosis using {{CLI}}. Collect profile, tablet, DDL,
stats, EXPLAIN, or connection evidence before forming hypotheses. Do not jump
to schema fixes.
+Evidence-first runtime diagnosis using doriscli. Collect profile, tablet, DDL,
stats, EXPLAIN, or connection evidence before forming hypotheses. Do not jump
to schema fixes.
+
+> `doriscli` is the **Apache Doris CLI** — install it for ergonomic profiling
and diagnostics. If it is not installed, use the **Native SQL + HTTP path** at
the end of this document — it is always available and the diagnostic reasoning
is identical.
---
### Binary Policy
-- Primary command: `{{CLI}}`
+- Primary command: `doriscli`
- Detection order:
- 1. `{{CLI_PATH_ENV}}` environment variable → use that binary path
- 2. `command -v {{CLI}}` → use from PATH
+ 1. `DORIS_CLI_PATH` environment variable → use that binary path
+ 2. `command -v doriscli` → use from PATH
3. If none available: use fallback SQL (see end of this document)
### Safety Policy
@@ -27,8 +29,8 @@ Evidence-first runtime diagnosis using {{CLI}}. Collect
profile, tablet, DDL, st
### Evidence-First Hard Gate
- Do not present root causes, DDL rewrites, materialized views, index changes,
bucket changes, or tuning recommendations until at least one evidence source
has been collected or attempted
-- Valid evidence sources: `profile get`, `profile diff`, `profile history`,
`profile list --active`, `tablet`, `{{CLI}} sql "EXPLAIN ..."`, fallback SQL
output, `auth status`, or an explicit user-provided profile/tablet/EXPLAIN
snippet
-- If a command cannot be executed locally because `{{CLI}}` or credentials are
unavailable, say that evidence could not be collected here, then provide the
exact commands for the user to run. Do not stop at "install {{CLI}}"
+- Valid evidence sources: `profile get`, `profile diff`, `profile history`,
`profile list --active`, `tablet`, `doriscli sql "EXPLAIN ..."`, fallback SQL
output, `auth status`, or an explicit user-provided profile/tablet/EXPLAIN
snippet
+- If a command cannot be executed locally because `doriscli` or credentials
are unavailable, say that evidence could not be collected here, then provide
the exact commands for the user to run. Do not stop at "install doriscli"
- If no evidence is available yet, the correct response is a short
investigation plan plus the first read-only command to run, not a diagnosis
- After collecting evidence, report findings as hypotheses with confidence and
caveats, not as guaranteed root causes
@@ -36,7 +38,7 @@ Evidence-first runtime diagnosis using {{CLI}}. Collect
profile, tablet, DDL, st
### Connection-First Rule
-If the user reports that `{{CLI}} sql`, `{{CLI}} profile get`, or any CLI
command **fails, times out, or returns an error**, treat it as a
connection-layer problem first, not a query performance problem. The first
command must be `{{CLI}} auth status --format json`. Do not suggest query
timeout changes, session variables, DDL, or performance tuning until
connectivity (mysql_status, http_status) is confirmed working. If the cluster
is Cloud-managed, also check `{{CLI}} cloud cluster get - [...]
+If the user reports that `doriscli sql`, `doriscli profile get`, or any CLI
command **fails, times out, or returns an error**, treat it as a
connection-layer problem first, not a query performance problem. The first
command must be `doriscli auth status --format json`. Do not suggest query
timeout changes, session variables, DDL, or performance tuning until
connectivity (mysql_status, http_status) is confirmed working. If this is a
storage-compute (cloud mode) deployment, also confirm co [...]
---
@@ -57,50 +59,50 @@ Use the most specific available path:
#### When user provides a query_id
-1. `{{CLI}} profile get <query_id> --format json`
+1. `doriscli profile get <query_id> --format json`
2. If profile fetch fails, still treat the failed fetch as evidence. Check
whether the failure looks like eviction, wrong HTTP port, or connectivity; see
Failure Handling below
-3. For each table in `scanned_tables`, run `{{CLI}} tablet <db.table> --detail
--format json`
+3. For each table in `scanned_tables`, run `doriscli tablet <db.table>
--detail --format json`
4. If `tablet` shows stale stats (zero rows on a known-populated table), note
this as a caveat — metadata can lag on new or tiny tables
5. Gather evidence, form hypotheses, recommend next checks
#### When user provides SQL but no query_id
-1. Check recent runs first: `{{CLI}} profile list --format json` — scan for
matching SQL text
+1. Check recent runs first: `doriscli profile list --format json` — scan for
matching SQL text
2. If a recent run exists, use that profile instead of re-executing
-3. If no recent run exists, run `{{CLI}} sql "EXPLAIN <query>" --format json`
+3. If no recent run exists, run `doriscli sql "EXPLAIN <query>" --format json`
4. If SQL looks expensive, unfamiliar, during peak hours, or could scan/join
large tables, show the EXPLAIN evidence and ask before profiled execution
-5. Only after the safety gate passes: `{{CLI}} sql "<query>" --profile
--format json`
+5. Only after the safety gate passes: `doriscli sql "<query>" --profile
--format json`
6. Extract `query_id` from response → continue with profile workflow above
#### When user gives only a vague symptom
-1. `{{CLI}} auth status --format json` — verify MySQL and HTTP connectivity
before query diagnosis
-2. `{{CLI}} profile list --active --format json` — look for currently
slow/running queries
-3. `{{CLI}} profile list --limit 20 --format json` — look for recent slow
profiles
-4. If the user gives a workload keyword but not exact SQL, optionally use
`{{CLI}} profile history "<keyword>" --days 7 --format json`
+1. `doriscli auth status --format json` — verify MySQL and HTTP connectivity
before query diagnosis
+2. `doriscli profile list --active --format json` — look for currently
slow/running queries
+3. `doriscli profile list --limit 20 --format json` — look for recent slow
profiles
+4. If the user gives a workload keyword but not exact SQL, optionally use
`doriscli profile history "<keyword>" --days 7 --format json`
5. If nothing relevant appears, ask for query_id, SQL text, dashboard/report
name, time window, and environment. Do not invent a query or DDL
#### When investigating a running query
-1. `{{CLI}} profile list --active --format json` — find active queries
+1. `doriscli profile list --active --format json` — find active queries
2. Wait for completion or ask user if they want to cancel/let it finish
3. Then fetch the profile with `profile get`
#### Regression comparison
-- `{{CLI}} profile diff <slow_qid> <fast_qid> --format json` —
operator-by-operator delta
+- `doriscli profile diff <slow_qid> <fast_qid> --format json` —
operator-by-operator delta
- Focus on operators where time or rows changed significantly
#### Trend analysis
-- `{{CLI}} profile history "<sql_pattern>" --days 7 --format json` — p50/p99
over time
+- `doriscli profile history "<sql_pattern>" --days 7 --format json` — p50/p99
over time
- Look for gradual growth (data volume) vs sudden jump (schema change,
partition change, cluster change)
#### When profile is unavailable or was not enabled
-1. Try `{{CLI}} profile list --limit 20 --format json` for recent profiles
-2. Try `{{CLI}} profile history "<sql_pattern>" --days 7 --format json` if a
query pattern is known
-3. Use `{{CLI}} sql "EXPLAIN <query>" --format json` and `{{CLI}} tablet
<db.table> --detail --format json` as fallback evidence
+1. Try `doriscli profile list --limit 20 --format json` for recent profiles
+2. Try `doriscli profile history "<sql_pattern>" --days 7 --format json` if a
query pattern is known
+3. Use `doriscli sql "EXPLAIN <query>" --format json` and `doriscli tablet
<db.table> --detail --format json` as fallback evidence
4. Ask to rerun with `--profile` only after the safety gate passes
---
@@ -149,7 +151,7 @@ Each mapping below is a **hypothesis**, not a guaranteed
root cause. Present as:
- **Evidence**: scan operator `selectivity` >> 100, or `total_scan_rows`
vastly exceeds output rows
- **Likely**: sort key does not match the query's primary filter columns —
Doris scans more data than necessary
-- **Check next**: `{{CLI}} tablet <table>` → compare `sort_key` columns with
the query's WHERE clause. Also check if an inverted index or BloomFilter could
help
+- **Check next**: `doriscli tablet <table>` → compare `sort_key` columns with
the query's WHERE clause. Also check if an inverted index or BloomFilter could
help
- **Possible fix**: move the high-selectivity filter column to sort key
position 1 → `schema-keys-selectivity-first`. Or add a secondary index
- **Not always this**: high selectivity can also result from stale column
stats causing bad partition pruning, or from querying across many partitions
where the filter is only selective within each partition
@@ -165,7 +167,7 @@ Each mapping below is a **hypothesis**, not a guaranteed
root cause. Present as:
- **Evidence**: `shuffle_bytes` is large on a JOIN operator, or `join_type` =
shuffle for a table that could use broadcast or colocation
- **Likely**: tables are not colocated and the smaller side exceeds the
broadcast threshold
-- **Check next**: check both tables' bucket keys and counts with `{{CLI}}
tablet`. Check dimension table size (< 1GB usually broadcasts automatically)
+- **Check next**: check both tables' bucket keys and counts with `doriscli
tablet`. Check dimension table size (< 1GB usually broadcasts automatically)
- **Possible fix**: for small dimensions (< 1GB), ensure broadcast join +
runtime filter are working. For large repeated joins, align bucket keys and
counts for colocation → `usecase-star-schema-join`
- **Not always this**: shuffle is sometimes correct for large-large joins. The
cost may also be dominated by the scan, not the shuffle itself
@@ -208,9 +210,9 @@ Each mapping below is a **hypothesis**, not a guaranteed
root cause. Present as:
#### Profile fetch fails
- Profile may be evicted (Doris keeps profiles for a limited time, often 5-15
minutes)
-- HTTP port may be misconfigured: Cloud mode uses 8080, self-hosted Doris uses
8030. `{{CLI}} auth status` shows `http_status` and `http_probe`
+- HTTP port may be misconfigured: Cloud mode uses 8080, self-hosted Doris uses
8030. `doriscli auth status` shows `http_status` and `http_probe`
- FE may be unreachable: check the `served_by` and `fetch_attempts` fields in
the error response
-- **Recovery**: if profile is evicted, try `profile history` for trend data
and `profile list` for nearby recent runs. Ask user to re-run with `--profile`
only after the safety gate passes. If HTTP port is wrong, guide user to fix
with `{{CLI}} auth add` (re-add environment with correct `--http-port`)
+- **Recovery**: if profile is evicted, try `profile history` for trend data
and `profile list` for nearby recent runs. Ask user to re-run with `--profile`
only after the safety gate passes. If HTTP port is wrong, guide user to fix
with `doriscli auth add` (re-add environment with correct `--http-port`)
#### Tablet metadata lag
@@ -225,15 +227,15 @@ Each mapping below is a **hypothesis**, not a guaranteed
root cause. Present as:
#### Connection fails during investigation
-- If `{{CLI}} auth status` shows `mysql_status: unreachable` or `http_status:
unreachable`, diagnose the connection before continuing with query investigation
-- For Cloud environments: cluster may be Suspended — check with `{{CLI}} cloud
cluster get`
+- If `doriscli auth status` shows `mysql_status: unreachable` or `http_status:
unreachable`, diagnose the connection before continuing with query investigation
+- For storage-compute (cloud mode) deployments: compute may be suspended or
scaled to zero — `doriscli auth status` shows `backends`; an empty or
all-not-`alive` list means compute is unavailable. Resume it from your
cluster-management console
- For BYOC: may need SOCKS5 proxy
---
-### Fallback SQL (When {{CLI}} is unavailable)
+### Native SQL + HTTP path (always available)
-Use these via `mysql` client or any MySQL-compatible tool:
+This path needs no special tooling — Apache Doris speaks the MySQL protocol.
Use these via the `mysql` client or any MySQL-compatible tool:
```sql
-- Table structure
diff --git a/skills/doris-best-practices/references/start-self-hosted.md
b/skills/doris-best-practices/references/start-self-hosted.md
index 8b922f0..82116a5 100644
--- a/skills/doris-best-practices/references/start-self-hosted.md
+++ b/skills/doris-best-practices/references/start-self-hosted.md
@@ -8,7 +8,7 @@ tags: [start, self-hosted, byoc, on-prem, setup]
**MySQL host + port**: the FE node address and MySQL protocol port. Default
port is `9030`. Check `fe.conf` → `mysql_service_port` or ask your DBA.
-**HTTP port**: FE HTTP port for profile fetching and REST API. Self-hosted
Apache Doris uses `8030` by default; {{CLOUD_PRODUCT_NAME}} uses `8080`. Check
`fe.conf` → `http_port`. If unsure, `{{CLI}} auth add` probes 8080/8030/8040
and shows suggestions.
+**HTTP port**: FE HTTP port for profile fetching and REST API. Self-hosted
Apache Doris uses `8030` by default; Apache Doris cloud mode uses `8080`. Check
`fe.conf` → `http_port`. If unsure, `doriscli auth add` probes 8080/8030/8040
and shows suggestions.
**User + password**: default root user is `root` with empty password on fresh
installs. Production clusters should have a password set via `SET PASSWORD`.
@@ -23,14 +23,14 @@ tags: [start, self-hosted, byoc, on-prem, setup]
3. Register BEs with FE: `ALTER SYSTEM ADD BACKEND "<be_host>:9050";`
4. Create database and tables
-### Connect with VeloCLI (Preferred)
+### Connect with the doriscli management CLI (optional)
```bash
-{{CLI}} auth add local --host <fe_host> --port 9030 --http-port 8030 --user
root --password "$DORIS_PASSWORD"
-{{CLI}} use local
-{{CLI}} auth status --format json
+doriscli auth add local --host <fe_host> --port 9030 --http-port 8030 --user
root --password "$DORIS_PASSWORD"
+doriscli use local
+doriscli auth status --format json
```
-### Connect via MySQL Client
+### Connect via MySQL client (always available)
```bash
mysql -h <fe_host> -P 9030 -u root -p"$DORIS_PASSWORD"
```
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]