This is an automated email from the ASF dual-hosted git repository. zclllyybb pushed a commit to branch profile in repository https://gitbox.apache.org/repos/asf/doris-skills.git
commit 138235960eb88626dea7f92a1898244a0e8fff06 Author: zhaochangle <[email protected]> AuthorDate: Mon Jun 1 17:12:07 2026 +0800 Add doris-profile-reader --- skills/doris-profile-reader/SKILL.md | 51 +++++ skills/doris-profile-reader/agents/openai.yaml | 21 ++ .../references/counter-semantics.md | 244 +++++++++++++++++++++ .../references/join-order-diagnosis.md | 111 ++++++++++ .../references/operator-guide.md | 140 ++++++++++++ .../references/reading-workflow.md | 135 ++++++++++++ .../references/runtime-filters.md | 92 ++++++++ .../scripts/extract_source_profile_inventory.py | 158 +++++++++++++ 8 files changed, 952 insertions(+) diff --git a/skills/doris-profile-reader/SKILL.md b/skills/doris-profile-reader/SKILL.md new file mode 100644 index 0000000..6b7057e --- /dev/null +++ b/skills/doris-profile-reader/SKILL.md @@ -0,0 +1,51 @@ +--- +name: doris-profile-reader +description: Interpret Apache Doris query runtime profiles, especially profile bottleneck triage, misleading wait counters, per-operator metric priority, scan, join-order/runtime-filter analysis, and evidence-bounded performance explanations. Use when given a Doris profile, query id, profile URL/text, or a request to explain Doris query performance. +--- + +# Doris Profile Reader + +## Purpose + +Use this skill to identify the real bottleneck in an Apache Doris query runtime profile. The core rule is to separate active work from dependency, queue, and backpressure waits before naming an operator as expensive. When the plan contains joins, also separate the immediate runtime bottleneck from the plan-shape cause, especially bad join order and runtime-filter direction. + +## Required Reading Order + +1. Read `references/reading-workflow.md` for the analysis workflow and output contract. +2. Read `references/counter-semantics.md` for counter meaning and priority, especially wait counters. +3. Read `references/operator-guide.md` for the relevant operator family. +4. Read `references/join-order-diagnosis.md` when the profile or plan has multiple joins, a large hash/nested-loop build, a large scan that might have been pruned by a join, paired fast/slow plans, hints/reordered SQL, or a request about join shape/reorder. +5. Read `references/runtime-filters.md` when a profile or plan includes `RuntimeFilterInfo`, `RF... <-`, `RF... ->`, `JRFs`, `WaitForRuntimeFilter`, or `AcquireRuntimeFilter`. +6. Use `references/source-profile-inventory.md` as the source-backed operator/counter inventory. If a counter or operator is not in the narrative docs, do not ignore it; look it up in this inventory and classify it by the rules in `counter-semantics.md`. + +## Non-Negotiable Interpretation Rules + +- Do not call `WaitForDependencyTime`, `WaitForDependency[...]Time`, `WaitForData0`, `WaitForDataN`, `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, `PendingFinishDependency`, or pipeline blocked/wait counters direct operator compute cost. They are dependency, data-arrival, queue, memory, or backpressure signals. +- Do not rank operators by the largest wait-like counter alone. First rank by `ExecTime`/active timers, direct custom timers, rows/bytes, memory/spill, and skew. +- Treat merged-profile `sum` timers as accumulated across parallel instances. A timer can exceed query elapsed time and still be normal when many scanners, drivers, or fragments run in parallel. +- For scans, prioritize `RowsRead`, `ScanRows`, `ScanBytes`, `ScannerCpuTime`, `ScannerGetBlockTime`, `ScannerWorkerWaitTime`, I/O/decompression timers, predicate/lazy-read timers, and row-filter counters. `ScannerWorkerWaitTime` is important, but it indicates scanner scheduling/thread-pool wait rather than scan CPU. +- For runtime filters, distinguish source/build side from target/probe scan side. In plan text, `RFxxx <- expr` is produced from the build side; `RFxxx -> expr` is applied at a target scan. In profiles, `RuntimeFilterInfo` and scan-side wait/filter counters decide whether the RF helped or just waited. +- For joins, always identify build side and probe/target side before judging the order. A scan can be the immediate active bottleneck while the root cause is still bad join order if a selective join/RF source is scheduled too late to prune that scan. +- Do not require a paired fast profile before naming likely bad join order. A single slow profile can be enough when it proves large wasted build/scan work before an empty probe/result, an RF source side that had to scan massively before it could emit an empty/tiny filter, or a huge intermediate later eliminated by a highly selective or contradictory join predicate. +- Do not treat "the current RF made other scans wait and then skip" as proof that the join order is good. If producing that empty/tiny RF required the only expensive scan, the RF source/target choice is itself the join-order question. +- When a large build/source side is paid before an empty/tiny probe, preserved side, semi-join key set, or contradictory join can eliminate the result, call the build/probe order likely bad unless the profile proves that ordering is semantically forced and no earlier pruning/short-circuit is possible. +- If a strong single-profile join-order pattern matches, do not hedge as "suspicious", "close to likely bad", or "not proven". Use `likely bad` when a better legal order still needs validation, and reserve `not proven` for the exact alternate shape, not for the join-order diagnosis itself. +- When a join query has plan-shape evidence, the answer must explicitly judge join order/build-probe/RF direction as good, suspicious, or bad. Do not replace that judgment with a vague "predicate issue" or "plan shape issue". +- A long `InitTime`, `OpenTime`, `CloseTime`, or profile total can matter, but only after confirming it is not accumulated across many instances and not dominated by a known wait/dependency branch. + +## Standard Answer Shape + +When explaining a profile, answer in this order: + +1. `Conclusion`: one or two sentences naming the likely bottleneck and, when joins are involved, whether the plan shape/join order is likely the cause. +2. `Evidence`: profile counters with operator names, values, and whether each is active work, data volume, wait/backpressure, memory/spill, or runtime-filter evidence. +3. `Reasoning`: how the evidence maps to the execution plan, which side is build/probe or RF source/target, why misleading counters are discounted, and whether the join order is reasonable. +4. `Next checks`: the smallest additional profile/log/code checks needed if the conclusion is still uncertain. + +Always preserve uncertainty. Use "proven", "likely", and "not proven" explicitly when the profile lacks enough detail. + +## Scripts + +- `scripts/extract_source_profile_inventory.py`: scan Doris source for factory-created operators and counter/info registrations. + +Scripts are evidence-generation helpers. They do not replace the reading workflow. diff --git a/skills/doris-profile-reader/agents/openai.yaml b/skills/doris-profile-reader/agents/openai.yaml new file mode 100644 index 0000000..34aa432 --- /dev/null +++ b/skills/doris-profile-reader/agents/openai.yaml @@ -0,0 +1,21 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +interface: + display_name: "Doris Profile Reader" + short_description: "Interpret Doris profiles, joins, and waits" + default_prompt: "Use $doris-profile-reader to analyze this Apache Doris query profile, discount misleading wait counters, identify the real bottleneck, and judge join order/build-probe/runtime-filter shape with counter evidence." diff --git a/skills/doris-profile-reader/references/counter-semantics.md b/skills/doris-profile-reader/references/counter-semantics.md new file mode 100644 index 0000000..7a254a1 --- /dev/null +++ b/skills/doris-profile-reader/references/counter-semantics.md @@ -0,0 +1,244 @@ +# Counter Semantics + +This file defines how to interpret every counter class. For individual counter names and source locations, use `source-profile-inventory.md`. + +## Universal Priority + +High-signal counters: + +- `ExecTime`: time inside the operator. In merged profiles, check `max` and per-instance detail before comparing to query elapsed. Before calling it active CPU, qualify it with child wait evidence such as runtime-filter wait, data-arrival wait, dependency wait, scanner worker wait, or spill task queue wait. Do not mechanically subtract wait counters from `ExecTime`; scan-level `WaitForRuntimeFilter`, per-filter `RFx WaitTime`, scanner child timers, and merged operator timers may overlap [...] +- `InputRows`, `RowsProduced`, `RowsRead`, `ScanRows`, `ScanBytes`, bytes sent/received: data volume that explains active time. +- Operator-specific direct timers: scan CPU/I/O/decompression, hash table build/probe, aggregation build/merge, sort, analytic evaluation, serialization/compression, spill read/write. +- `MemoryUsage`, `MemoryUsagePeak`, `MemoryUsageHashTable`, `MemoryUsageBuildBlocks`, `MemoryUsageSortBlocks`: resource footprint. +- Spill counters: direct evidence that memory pressure changed the execution path. + +Low-directness counters: + +- `WaitForDependencyTime` and `WaitForDependency[...]Time`: dependency wait. Important for pipeline causality, not direct compute. +- `WaitForData0` and `WaitForDataN`: exchange receiver queue data wait. Usually upstream did not produce data yet. +- `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, `WaitForLocalExchangeBufferN`: downstream or channel backpressure. +- `PendingFinishDependency`: sink finish dependency. Do not call it sink work. +- `WaitWorkerTime`, `BlockedByDependency`, `MemoryReserveTimes`, `MemoryReserveFailedTimes`: scheduler/dependency/memory state. +- `FirstBatchArrivalWaitTime`, `DataArrivalWaitTime`: data-arrival symptom; use with upstream operator evidence. +- FE `Plan Time`, `Schedule Time`, `Wait and Fetch Result Time`, and client/result fetch time: query lifecycle or client-facing timing. They can explain end-to-end latency, but do not rank them as BE operator active work. + +## Common Operator Counters + +- `InitTime`: local state/profile initialization. Usually not the bottleneck unless large in per-instance detail. +- `OpenTime`: opening child/local state/resources. Can matter for external scans/sinks or cold metadata, but verify with custom counters. +- `ExecTime`: primary active operator timer. +- `CloseTime`: close/finalization. For sinks it may include writer close, publish, commit, or file close; inspect custom sink timers. +- `ProjectionTime`: expression/projection after core operator work. Important if high with many output columns or expensive expressions. +- `RowsProduced`: source/operator output rows. +- `InputRows`: sink input rows. +- `BlocksProduced`, `OutputBlockBytes`, `MaxOutputBlockBytes`, `MinOutputBlockBytes`: vectorized block output volume and block-size skew. +- `MemoryUsage`, `MemoryUsagePeak`: current/peak memory tracked by the operator. +- `IsShuffled`, `IsColocate`, `FollowedByShuffledOperator`: plan/execution metadata, not cost by itself. +- `PendingFinishDependency`: wait for sink finalization dependency, not active work. + +## Pipeline Task Counters + +- `TaskCpuTime`: accumulated CPU attributed to a pipeline task. It can exceed wall time when summed. +- `ExecuteTime`: task execution loop time. +- `PrepareTime`, `OpenTime`, `GetBlockTime`, `SinkTime`, `CloseTime`: task lifecycle partitions. +- `WaitWorkerTime`: time waiting for worker scheduling. +- `NumScheduleTimes`, `NumYieldTimes`, `CoreChangeTimes`: scheduling behavior and task migration/yield signals. +- `MemoryReserveTimes`, `MemoryReserveFailedTimes`: memory reservation pressure; pair with blocked time and memory peaks. +- `BlockedByDependency`: count/state of dependency blocking, not CPU. + +## Time Counter Name Patterns + +- `*Time`, `*Timer`, `*Costs`: elapsed or accumulated time for the named action. Determine whether it is direct work or wait by the action name. +- `Wait*`, `*Wait*`, `Blocked*`, `Pending*`: wait/dependency/backpressure unless source code proves otherwise. +- `*CpuTime`: CPU-side work. Can be accumulated across threads. +- `*IOTimer`, `*ReadFileTime`, `*WriteFileTime`, `FileWriteTime`: I/O path work or I/O wait. Use bytes and cache counters to interpret. +- `*Serialize*`, `*Deserialize*`, `CompressTime`, `DecompressorTimer`: CPU transformation cost. +- `*EvalTime`, `ExprTime`, `ProjectionTime`, `VectorPredEvalTime`: expression or predicate CPU. + +## Count/Volume Counter Name Patterns + +- `*Rows`, `*RowsRead`, `*InputRows`, `*RowsProduced`: row volume. +- `*Bytes`, `*MemBytes`, `ScanBytes`, `CompressedBytesRead`, `UncompressedBytesRead`: byte volume or memory/storage size. +- `*Count`, `*Num`, `*Times`: event count. +- `*Filtered`, `*FilteredRows`, `Rows*Filtered`: rows skipped or filtered. High values can be good if they avoid downstream work. +- `*CacheHit*`, `*CacheMiss*`: cache effectiveness. Interpret with I/O timers. +- `*Peak`: peak observed value. + +## Scan Counters + +Common scan: + +- `RowsRead`: rows returned by the scanner after scan-side pruning/filtering. +- `ScanRows`: rows scanned/read from storage or source. Larger than `RowsRead` means scan-side filtering/pruning removed rows. +- `ScanBytes`: bytes scanned. In OLAP scan source comments this represents uncompressed bytes from local plus remote reads. +- `NumScanners`: scanner instances created. +- `ScannerCpuTime`: direct scanner CPU. Primary scan bottleneck timer. +- `ScannerGetBlockTime`: time getting blocks from scanners. +- `ScannerFilterTime`: scan-side filter time. +- `ScannerWorkerWaitTime`: time scanner work waited in the scan worker pool. High value indicates scanner scheduling/thread-pool contention or too much scan concurrency, not scan CPU. +- `WaitForRuntimeFilter`, `AcquireRuntimeFilter`, and `RuntimeFilterInfo/RFx WaitTime`: scan-side wait before or during scan startup/filter application. These can make scan `ExecTime` look large even when storage CPU/I/O is small. +- `RFx InputRows`, `RFx FilterRows`, and `RFx AlwaysTrueFilterRows`: per-filter counters, often accumulated across parallel scan instances and filters. They may exceed a table's final scan rows when summed across filters/instances. Compare them per filter and against `ScanRows`, `RowsRead`, and target-side output; do not add them across filters as if they were disjoint. +- `MaxScanConcurrency`, `MinScanConcurrency`, `RunningScanner`: concurrency and skew. +- `ConditionCacheHit`, `ConditionCacheFilteredRows`: condition cache effectiveness. +- `PerScannerRunningTime`, `PerScannerRowsRead`, `PerScannerWaitTime`, `PerScannerProjectionTime`, `PerScannerPrepareTime`, `PerScannerOpenTime`: per-scanner skew and phase evidence. + +OLAP scan storage path: + +- `ReaderInitTime`, `ScannerInitTime`, `ProcessConjunctTime`: scan setup and predicate preparation. +- `BlockLoadTime`, `BlocksLoad`, `BlockFetchTime`, `BlockInitTime`, `BlockInitSeekTime`, `BlockInitSeekCount`: segment/block loading and seek overhead. +- `CompressedBytesRead`, `UncompressedBytesRead`, `IOTimer`, `DecompressorTimer`: storage I/O and decompression cost. +- `PredicateColumnReadTime`, `NonPredicateColumnReadTime`, `OutputColumnTime`, `LazyReadTime`, `LazyReadSeekTime`: column read and late materialization cost. +- `VectorPredEvalTime`, `ShortPredEvalTime`, `ExprFilterEvalTime`: predicate expression CPU. +- `RowsStatsFiltered`, `RowsKeyRangeFiltered`, `RowsConditionsFiltered`, `RowsZoneMapRuntimePredicateFiltered`, `RowsBloomFilterFiltered`, `RowsDelFiltered`, `SegmentDictFiltered`: pruning and delete/filter effectiveness. Runtime filters, especially min/max or storage-pushed filters, can show their benefit here even when a per-filter `FilterRows` counter is zero or misleading. +- `GenerateRowRangeByKeysTime`, `GenerateRowRangeByColumnConditionsTime`, `GenerateRowRangeByBloomFilterIndexTime`, `GenerateRowRangeByZoneMapIndexTime`, `GenerateRowRangeByDictTime`: index/range construction cost. +- Inverted-index counters (`InvertedIndex*`, `RowsInvertedIndexFiltered`): index reader/search/filtering cost and effectiveness. +- ANN counters (`AnnIndex*`, `AnnIvf*`): vector index search, conversion, cache, fallback, and filtering. +- Variant counters (`Variant*`, sparse-column counters): variant/sparse subcolumn materialization and filtering. +- `AdaptiveBatchPredictMaxRows`, `AdaptiveBatchPredictMinRows`: adaptive batch sizing metadata, not direct cost. + +File/external scan: + +- `FileScannerGetBlockTime`, `FileScannerPreFilterTimer`, `ApplyAllRuntimeFilters`: file scanner work and RF application. +- `FileNumber`, `EmptyFileNum`, `BatchSplitMode`: split/file shape. +- Connector-specific read/open/list counters should be treated as remote I/O or metadata cost; verify with bytes and file count. + +## Join Counters + +Hash join build sink: + +- `BuildTime`, `BuildHashTableTime`, `BuildTableInsertTime`, `BuildExprCallTime`, `MergeBuildBlockTime`: direct build-side work. +- `InputRows`, `HashTableSize`, `MemoryUsageHashTable`, `MemoryUsageBuildBlocks`, `MemoryUsageBuildKeyArena`: build-side size and memory. +- `PublishTime`, `RuntimeFilterInfo`: runtime filter publication. +- `BroadcastJoin`, `JoinType`, `ShareHashTableEnabled`, `BuildShareHashTable`, `SkipProcess`: plan/state metadata. +- `AsofIndexExprTime`, `AsofIndexSortTime`, `AsofIndexGroupTime`: ASOF join index preparation cost. + +Hash join probe: + +- `ProbeRows`, `ProbeIntermediateRows`: probe-side volume and intermediate expansion. +- `ProbeExprCallTime`, `ProbeWhenSearchHashTableTime`, `ProbeWhenProbeSideOutputTime`, `ProbeWhenBuildSideOutputTime`, `NonEqualJoinConjunctEvaluationTime`, `JoinFilterTimer`, `BuildOutputBlock`, `FinishProbePhaseTime`: direct probe/output/filter work. +- `MemoryUsageProbeKeyArena`: probe key memory. +- `WaitForDependency[HASH_JOIN_BUILD_DEPENDENCY]Time` on probe is waiting for build readiness. It is not probe CPU. + +Partitioned hash join: + +- `Spilled`, `SpillBuildTime`, `SpillProbeTime`, `SpillRePartitionTime`, `SpillBuildRows`, `SpillProbeRows`, `SpillRecovery*`, `SpillMaxPartitionLevel`, `SpillTotalPartitions`: partitioned join spill path and recovery. +- `BuildRows`, `ProbeBlocksBytesInMem`, `SpillInMemRow`: partitioned memory/volume shape. + +Nested loop/cross join: + +- `LoopGenerateJoin`, `JoinConjunctsEvaluationTime`, `FilteredByJoinConjunctsTime`, `OutputTempBlocksTime`, `UpdateVisitedFlagsTime`: direct nested-loop work. +- Large probe/intermediate rows are usually more important than wait counters. + +## Aggregation and Set Counters + +Aggregation sink/source: + +- `BuildTime`, `HashTableComputeTime`, `HashTableEmplaceTime`, `HashTableInputCount`, `HashTableSize`: primary group-by/hash aggregation cost. +- `MergeTime`, `DeserializeAndMergeTime`, `GetResultsTime`, `HashTableIterateTime`, `InsertKeysToColumnTime`, `InsertValuesToColumnTime`: merge/output phase. +- `ExprTime`, `DoLimitComputeTime`: expression and limit work. +- `MemoryUsageHashTable`, `MemoryUsageArena`, `MemoryUsageContainer`, `MemoryUsageSerializeKeyArena`: aggregation memory. +- `Spilled`, `SpillSerializeHashTableTime`, `SpillTotalTime`, `SpillTotalPartitions`, `SpillMaxPartitionLevel`: spill path. + +Streaming/distinct aggregation: + +- `StreamingAggTime`: streaming aggregation work. +- `BuildTime`, `HashTableComputeTime`, `HashTableEmplaceTime`, `HashTableInputCount`, `HashTableSize`: distinct/group state. + +Set operators: + +- `BuildTime` on `SET_SINK_OPERATOR`: build set/hash table. +- `ProbeTime`, `ExtractProbeDataTime` on `SET_PROBE_SINK_OPERATOR`: probe/extract set results. +- `GetDataTime`, `FilterTime`, `GetDataFromHashTableRows` on intersect/except: set output/filter work. + +## Sort, Window, and Materialization Counters + +Sort: + +- `SortTime`, `AppendBatchTime`, `AppendBlockTime`, `MergeGetNext`, `MergeGetNextBlock`, `MemoryUsageSortBlocks`: direct sort and output work. +- Spill sort uses `SpillSort*`, `SpillMergeSortTime`, and common spill counters. +- `WaitForDependency[SORT_SOURCE_DEPENDENCY]Time` generally means waiting for the sink to finish sorting. + +Partition sort: + +- `HashTableBuildTime`, `SortedDataTime`, `SortedPartitionInputRows`, `SortedPartitionOutputRows`, `GetSortedTime`: partitioned/window sort setup and output. + +Analytic/window: + +- `ComputePartitionByTime`, `ComputeOrderByTime`, `ComputeAggDataTime`, `ComputeRangeBetweenTime`, `EvaluationTime`, `PartitionSearchTime`, `OrderSearchTime`, `RemoveRowsTime`: direct window function work. +- `Blocks`, `BlocksPeak`, `RemoveRows`, `RemoveCount`, `streaming mode`: window state and mode. + +Materialization/select/repeat/table function: + +- `ProjectionTime`, expression/evaluation timers, `RowsProduced`, and output bytes are the main proof. +- `MATERIALIZATION_OPERATOR` can be a top/lazy materialization node that fetches deferred columns by row id through `multiget_data_v2` RPCs. `MaxRpcTime` is the max elapsed RPC round-trip for a batch, not local projection CPU. Treat it as late row-id fetch latency unless child `RowIDFetcher` profiles, fetched rows/bytes, `MergeResponseTime`, and output row count prove a materialization-heavy workload. +- Do not rank a top/lazy materialization node above upstream scan/join/aggregation/memory evidence when it only materializes a small final row set. Trace `row_ids` or row-id table info back to the scan/join branch that produced those row ids, then report materialization separately as final-column fetch latency. +- `Repeat` expands grouping sets; compare input/output rows. +- `TableFunction` can multiply rows; compare produced rows and expression time. + +## Exchange and Network Counters + +Exchange source: + +- `GetDataFromRecvrTime`, `DeserializeRowBatchTimer`, `DecompressTime`, `FilterTime`, `CreateMergerTime`, `MergeGetNext`, `MergeGetNextBlock`: direct receive/merge/decode work. +- `LocalBytesReceived`, `RemoteBytesReceived`, `DecompressBytes`, `BlocksProduced`: data volume. +- `WaitForData0`, `WaitForDataN`, `DataArrivalWaitTime`, `FirstBatchArrivalWaitTime`, `SendersBlockedTotalTimer(*)`, `MaxWaitForWorkerTime`, `MaxWaitToProcessTime`: wait/backpressure/scheduling signals. + +Exchange sink: + +- `SerializeBatchTime`, `CompressTime`, `DistributeRowsIntoChannelsTime`, `SplitBlockHashComputeTime`, `LocalSendTime`, `MergeBlockTime`, `SendNewPartitionTime`, `AddPartitionRequestTime`: direct send/shuffle work. +- `BytesSent`, `LocalBytesSent`, `LocalSentRows`, `UncompressedRowBatchSize`, `BlocksProduced`, `OverallThroughput`: volume/throughput. +- `RpcMaxTime`, `RpcAvgTime`, `RpcMinTime`, `RpcSumTime`, `RpcCount` on `DATA_STREAM_SINK_OPERATOR`: remote send RPC latency. A high max/avg with meaningful bytes and small serialize/compress time points to network, receiver, or RPC path latency, not local sink CPU. +- `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, `WaitForLocalExchangeBufferN`: downstream backpressure or queue capacity wait. + +Local exchange and multicast: + +- `ComputeHashValueTime`, copy/channel push/pop timers, local bytes/rows, and buffer waits explain in-BE reshuffle or fanout. +- Long local exchange wait usually reflects receiver/sender imbalance, not compute in the local exchange itself. + +## Sink and Writer Counters + +Result/file/mysql/flight sinks: + +- `AppendBatchTime`, `BytesSent`, `FileWriteTime`, `FileWriterCloseTime`, `ConvertBlockToArrowBatchTime`: direct output cost. +- Result sinks can be slow because clients fetch slowly; look for send/write wait and bytes. + +OLAP/table/load sinks: + +- `AppendNodeChannelTime`, `CloseWaitTime`, `CloseWriterTime`, `CloseLoadTime`, `CommitTxnTime`, `BuildRowsetTime`, `AddPartitionRequestTime`: write/load/finalization phases. +- High close/commit is sink-side only if custom writer timers agree. + +External table sinks: + +- Hive/Iceberg/MaxCompute/JDBC/TVF counters represent connector write, file write, delete-file write, and close/commit costs. Use file counts, bytes, writer close/write timers, and connector-specific errors/logs. +- Iceberg delete/merge/sort spill counters show delete file generation, merge output, and spill path. + +Blackhole/cache/memory scratch/dictionary sinks: + +- Blackhole `BytesProcessed` is volume consumed, not output I/O. +- Cache counters show cache fill/read size and tablet id metadata. +- Memory scratch conversion timers show in-memory Arrow/result conversion. +- Dictionary sink counters should be read as dictionary build/publish/write path. + +## Runtime Filter Counters + +See `runtime-filters.md`. In short: + +- `RuntimeFilterInfo` on join/build/sink shows produced filters and target metadata. +- `WaitForRuntimeFilter` and `AcquireRuntimeFilter` on scans/sources are waits for filters. +- `RFx InputRows`, `RFx FilterRows`, `RFx AlwaysTrueFilterRows`, `RFx WaitTime` tell whether each RF was useful. +- Waiting for a runtime filter is worthwhile only if it filters enough scan rows or prevents expensive downstream work. +- Runtime-filter wait counters can appear on scans, multicast/source operators, or dynamic/top-N filter paths. Treat them as wait/synchronization evidence first, then use the plan and detail profile to find the producer and target. + +## Spill Counters + +Common spill counters are direct evidence of a different execution path: + +- `SpillTotalTime`, `SpillWriteTime`, `SpillRecoverTime`: overall spill cost. +- `SpillWriteTaskWaitInQueueTime`, `SpillReadTaskWaitInQueueTime`: spill task queue wait. +- `SpillWriteFileTime`, `SpillReadFileTime`: spill file I/O. +- `SpillWriteSerializeBlockTime`, `SpillReadDeserializeBlockTime`: CPU transform cost. +- `SpillWriteBlockCount`, `SpillWriteBlockBytes`, `SpillWriteRows`, `SpillWriteFileBytes`, `SpillWriteFileTotalCount`: write volume. +- `SpillReadBlockCount`, `SpillReadBlockBytes`, `SpillReadRows`, `SpillReadFileBytes`, `SpillReadFileCount`: read volume. +- `SpillMaxRowsOfPartition`, `SpillMinRowsOfPartition`, `SpillMaxPartitionLevel`, `SpillTotalPartitions`: partition skew and spill fanout. + +If spill exists, it is usually more important than generic wait counters. diff --git a/skills/doris-profile-reader/references/join-order-diagnosis.md b/skills/doris-profile-reader/references/join-order-diagnosis.md new file mode 100644 index 0000000..ca82767 --- /dev/null +++ b/skills/doris-profile-reader/references/join-order-diagnosis.md @@ -0,0 +1,111 @@ +# Join Order Diagnosis + +Use this file after the normal bottleneck pass. Keep the skill's main discipline: first prove the immediate runtime cost, then decide whether that cost is a symptom of a bad join order, bad build/probe choice, or late/missing runtime filter. + +## 1. Identify Build, Probe, Source, and Target + +Hash join: + +- `HASH_JOIN_SINK_OPERATOR` is the build side. Its decisive counters are `InputRows`/`BuildRows`, `BuildTime`, `BuildHashTableTime`, `BuildTableInsertTime`, `MemoryUsageHashTable`, `MemoryUsageBuildBlocks`, and `RuntimeFilterInfo`. +- `HASH_JOIN_OPERATOR` is the probe side. Its decisive counters are `ProbeRows`, `ProbeIntermediateRows`, probe expression/search/output timers, and final `RowsProduced`. +- In non-pipeline or older profiles, a single `VHASH_JOIN_NODE` may contain both sides. Use `BuildRows`, `ProbeRows`, `BuildTime`, `ProbeTime`, hash-table memory, and plan child order/RF arrows to identify the sides. +- In Doris plans, the hash table is normally built from the join's build/right child, but verify with counters or RF arrows when the profile is merged or the plan has been transformed. + +Nested-loop/cross join: + +- `CROSS_JOIN_SINK_OPERATOR` is the build/materialized side. +- `CROSS_JOIN_OPERATOR` is the probe/loop side. Its decisive counters are `ProbeRows`, `LoopGenerateJoin`, `JoinConjunctsEvaluationTime`, `FilteredByJoinConjunctsTime`, and `OutputTempBlocksTime`. +- A nested-loop join over a large intermediate result is often a plan-shape problem even when the direct operator cost is correctly attributed to nested-loop active timers. + +Runtime filters: + +- `RFxxx <- expr` is the source/build side that produces the filter. +- `RFxxx -> expr` is the target/probe scan side that may be pruned. +- A useful order usually produces filters from a small/selective build side early enough to prune a large probe-side scan. A suspicious order scans/builds a large fact side before the small/selective branch can produce its RF. +- An empty RF is not automatically a good outcome. If the query spends most of its time scanning the RF source just to produce an empty/tiny filter, then the RF source side may be the wrong side even though downstream targets skipped work. +- Empty downstream target scans prove only that the RF arrived before those targets did work. They do not prove the source choice was good. If the source scan/build dominates elapsed time, treat the plan as "RF direction likely bad: the expensive side produced the pruning signal instead of receiving one" unless there is concrete counterevidence. + +Memo and plan alternatives: + +- If memo or verbose explain contains multiple physical expressions for the same join group, inspect both child orders. A reversed physical expression is valid counterfactual evidence even when no fast profile is present. +- Compare the chosen side against table keys, distribution keys, colocate/bucket properties, and available RF targets. If the chosen RF source uses non-key predicates and scans massively, while the other side's join key can target the source table's key/prefix/distribution key, classify the chosen order as suspicious or bad. +- Do not accept the optimizer's estimated lower cost as proof of correctness when actual profile rows contradict the estimate. A low estimated build cardinality that turns into a huge zero-output scan is evidence of bad join-order costing. + +## 2. Decide Whether the Join Order Is Reasonable + +Treat join-order diagnosis as a plan-shape layer, separate from immediate runtime cost. Report both layers. + +Strong evidence of bad join order: + +- A paired fast/slow profile, hint, or SQL rewrite with the same intended result changes only join shape/RF direction and turns a large scan/hash build into tiny or zero scan rows. +- A large fact table is scanned or used as hash-build input before a selective dimension/subquery branch is reduced to a few rows. +- The slow plan builds a hash table on millions/billions of rows while an alternate legal shape builds on a tiny side and uses RF to prune the large side. +- A scan is the active bottleneck only because the RF that could prune it is produced later, on the wrong side, or after the scan has already done most work. +- The plan creates a huge intermediate result and a later join/filter reduces it to zero or a tiny row count, especially with contradictory predicates or predicates that could have been pushed below the join. +- Estimates in explain/memo say a branch is large, but profile actual rows show it is tiny, causing the optimizer to choose the wrong build/probe side. + +Single slow profile evidence can also be strong enough. Do not require a fast profile when these patterns appear: + +- Empty-probe or empty-result build waste: a hash join builds/shuffles millions or billions of rows, but `ProbeRows` or final output is zero/tiny. The direct cost is build/scan, but the plan-shape cause is that the large build side was scheduled before proving the probe/output side was empty or highly selective. If join type semantics make a swap uncertain, say "likely bad join order/build-side scheduling" rather than "not join order". +- RF-source inversion: the current RF source/build side is itself the expensive unpruned scan and produces zero/tiny rows only after scanning a huge table. Its RF target scans then wait and skip work. This is not merely a scan/index problem; it means the plan chose the side that needed pruning as the side responsible for producing the pruning signal. Check whether the expensive side's join key is a table key/prefix, distribution key, or colocated key and whether no RF is applied to that side. +- Expensive-empty source: a scan/build side does massive active work and outputs zero/tiny rows, then its RF makes other branches skip. Classify this as likely bad join order/RF direction even if that branch also has a missing index, non-key predicate, bad literal, or stale estimate. The other anomaly explains why the branch is expensive or empty; it does not make the expensive branch a good RF source. +- Late emptying predicate: a later join, semi join, analytic/filter, or non-equi conjunct reduces a large intermediate to zero/tiny rows. If that predicate references only a subset of the joined tables or is contradictory, classify the earlier large join/intermediate as bad join order or missed predicate placement. +- Ineffective RF to a large side: the plan contains an RF targeting a large scan, but profile shows it is ignored, not pushed down, arrives after scan work, has zero filter rows, or the large scan still produces millions of rows. If a small/selective branch exists, this is evidence of bad RF direction/timing even without a fast profile. +- Estimate/actual inversion: the optimizer estimates the build/source side tiny, but profile proves it performed massive scan work to produce zero/tiny rows, while downstream tables only waited for its RF. Treat this as a likely join-order/RF-direction error. Do not stop at "the build side scan is slow" unless the profile proves no join alternative could prune it. +- Key-target inversion: the expensive source table is scanned by non-key/non-prefix predicates, while its join key is a key prefix, distribution key, or otherwise better pruning target. If current order makes that table produce the RF instead of receive one, call the join order/RF direction bad unless profile proves the alternative cannot prune it. +- Huge build for empty/tiny probe: a hash join build side scans/shuffles/builds millions or billions of rows while the probe side is zero/tiny and final output is zero/tiny. Report direct cost as build/scan, and report plan-shape root cause as likely bad join order/build-side scheduling even for an outer join unless the profile proves no earlier empty-preserved-side check or legal reorder can avoid the build. +- Contradictory non-equi join late: an inner nested-loop/cross join has conjuncts that cannot both be true, or a highly selective non-equi relation, but the plan first materializes a large intermediate. This is not just expression simplification; it is bad join order/predicate placement because the impossible/selective relation was applied after the row explosion. + +When any strong single-profile pattern above matches, do not write "join order not proven" as the main diagnosis. Use one of these verdicts: + +- `bad`: a paired plan, memo alternative, legal rewrite, key/RF target evidence, or inner-join contradiction shows the expensive work should have been avoided. +- `likely bad`: the profile alone proves huge wasted scan/build/intermediate work before an empty/tiny side or late predicate, but legal alternatives still need confirmation. +- `suspicious`: only for weak patterns where build/probe/RF sides or row-flow evidence are incomplete. Do not use `suspicious` for expensive-empty source, huge build for empty/tiny probe, tiny semi key after fact build, or contradictory non-equi join late. + +Keep uncertainty scoped correctly: + +- It is acceptable to say the exact fix or alternate legal order is not proven. +- It is not acceptable to demote the diagnosis to "not join order" or "only scan/literal/index problem" when the profile already proves the large side was paid before an empty/tiny side could prune it. +- In the conclusion, put the join verdict before coexisting anomalies when the user's question is profile performance: "runtime bottleneck is X; join order/build scheduling is likely bad; bad literal/stats/index may explain why the selective side became empty." + +Weaker evidence: + +- A single profile shows large scan/build work and late/weak RF, but there is no alternate plan, no SQL/hint comparison, and no stats/memo evidence. +- Outer/semi/anti joins restrict legal reorder. A costly order may be required by semantics unless an equivalent rewrite or optimizer rule is known. + +Do not stop at "scan bottleneck" when the scan exists because of join order. Say: "runtime bottleneck is the scan/hash build; likely plan-shape root cause is join order/RF direction because ..." + +## 3. Join Type Constraints + +- Inner joins are usually reorderable; choose the shape that builds on the smaller/selective side and probes/prunes the larger side. +- For left outer join, the left side is preserved. For right outer join, the right side is preserved. Swapping can change semantics unless the optimizer rewrites the join equivalently. +- Outer join preservation does not make every build/probe choice good. If a preserved-side scan/probe is empty after its own predicates while the non-preserved build side was already fully scanned/built, report the direct build waste and classify the order as likely bad scheduling or missed empty-side short-circuit/predicate placement. Keep the semantic caveat, but do not downgrade it to a pure scan problem. +- For outer joins, distinguish logical join order from physical work order. Even when logical reordering is constrained, a profile can still prove bad physical build-side scheduling if Doris builds the non-preserved side before discovering the preserved side has zero rows. +- For left outer joins in hash execution, `ProbeRows=0` with huge right/build `BuildRows` is a likely bad physical work order: the build side was paid before the preserved side was known empty. State that even if the SQL contains another bug, and make the legal-rewrite caveat secondary. +- For semi/anti joins, the preserved/output side and filter side matter. A good order applies the selective semi/anti result before scanning or building large downstream joins when legal. +- For semi joins, a tiny key set from the filter side that is applied only after a large fact scan/build is strong evidence of bad join order or RF timing. +- For non-equi joins, nested-loop may be required, but the order can still be bad if it materializes a huge intermediate before applying a selective or impossible predicate. A contradictory join conjunct on an inner join is a join-order/predicate-placement failure when earlier joins create large rows that the contradictory join later discards. State "join order/predicate placement is bad"; do not stop at "constant folding/predicate simplification missing". + +## 4. Distinguish Join Order From Other Anomalies + +Other anomalies can coexist with bad join order: + +- Bad literal rewrite, stale stats, missing index, weak partition pruning, or an expensive scalar expression may explain why one branch is selective or empty. +- They do not explain why the engine scanned/built another huge side before using that selectivity, or why an empty/tiny branch was unable to prune the expensive side. +- State both: "runtime bottleneck is large scan/build; another anomaly explains selectivity; join order/RF direction is still wrong because the plan pays the large side before the selective/empty branch can prune it." + +Avoid this failure mode: calling the expensive RF-source scan a pure access-path problem when the rest of the plan proves that all other large scans waited for its empty/tiny RF. That is exactly the case where the RF source/target choice should be challenged. + +Avoid this failure mode: calling a large build before an empty probe "only an outer join semantic constraint". If the preserved side is empty after its own predicates, then the build side contributed nothing to the result; the plan order/scheduling failed to exploit the empty side. + +Avoid this failure mode: calling an impossible non-equi join only a predicate simplification issue. If a large intermediate is produced before the impossible join, the bad symptom is join order/predicate placement: the impossible or highly selective relation was applied too late. + +## 5. Output Checklist + +When a query has joins, include these facts: + +- `Runtime bottleneck`: the active scan/join/build/probe cost proven by counters. +- `Join-order diagnosis`: whether the order/build side/RF direction is good, suspicious, or bad. +- `Build/probe evidence`: build side rows/time/memory, probe side rows/time, and join type. +- `RF evidence`: source side, target side, wait, filtering/pruning, and whether it arrived early enough. +- `Counterfactual evidence`: paired profile, hint, rewrite, memo, or stats that shows a better legal order. If absent, say the join-order conclusion is likely but not proven. diff --git a/skills/doris-profile-reader/references/operator-guide.md b/skills/doris-profile-reader/references/operator-guide.md new file mode 100644 index 0000000..66fe07f --- /dev/null +++ b/skills/doris-profile-reader/references/operator-guide.md @@ -0,0 +1,140 @@ +# Operator Guide + +This guide covers the major Doris runtime profile operator families. Use `source-profile-inventory.md` for exact source registration lines and counter/source anchors. + +## Scan Operators + +`OLAP_SCAN_OPERATOR` reads Doris internal storage. First check `ScanRows`, `RowsRead`, `ScanBytes`, `ScannerCpuTime`, `ScannerGetBlockTime`, `ScannerWorkerWaitTime`, `IOTimer`, `DecompressorTimer`, predicate timers, lazy read timers, runtime-filter wait/filter counters, and row-filter counters. A long scan is important when active scanner timers and large bytes/rows agree. If `ExecTime` is mostly explained by `RuntimeFilterInfo/RFx WaitTime`, `AcquireRuntimeFilter`, or `WaitForRuntimeFil [...] + +`FILE_SCAN_OPERATOR` reads external files through file scanner and connector code. Prioritize file count/split shape, file read/open/list timers, `FileScannerGetBlockTime`, `ApplyAllRuntimeFilters`, scan rows/bytes, and connector I/O. Metadata/listing time can dominate small queries; data read/decode dominates large ones. + +`JDBC_SCAN_OPERATOR` reads remote JDBC sources. Prioritize remote query/read/fetch timers, rows, bytes, and network/serialization. A slow JDBC scan may be remote database latency rather than Doris CPU. + +`GROUP_COMMIT_SCAN_OPERATOR` scans group-commit buffered data. Prioritize rows, blocks, group-commit state, and wait/dependency only as queue state. + +`SCHEMA_SCAN_OPERATOR` reads schema/system metadata. Usually small. If slow, check metadata service/catalog lookup timers and rows produced. + +`META_SCAN_OPERATOR` reads internal metadata. Treat it like a metadata scan; long active time points to metadata lookup or serialization, not table data scan. + +`DATA_GEN_SCAN_OPERATOR` generates rows. Prioritize generated row count and expression/generation active timers. + +`REC_CTE_SCAN_OPERATOR` reads recursive CTE state. Use recursive CTE source/sink counters and rows per iteration to find expansion. + +`CACHE_SOURCE_OPERATOR` reads cached intermediate data. Prioritize cache tablet/id metadata, cache hit/read counters, rows/bytes, and active read time. + +`EMPTY_SET_OPERATOR` or `EMPTY_SET_SOURCE_OPERATOR` produces no data. It should not be the bottleneck unless profile setup/open time is anomalously high. + +## Exchange Operators + +`EXCHANGE_OPERATOR` receives data from sender queues. Prioritize `GetDataFromRecvrTime`, deserialize/decompress/filter/merge timers, local/remote bytes received, and produced rows. `WaitForData0`/`WaitForDataN`, `DataArrivalWaitTime`, and `FirstBatchArrivalWaitTime` mean the receiver waited for data. They usually point to upstream work or network/backpressure, not exchange CPU. + +`DATA_STREAM_SINK_OPERATOR` sends data across fragments. Prioritize `SerializeBatchTime`, `CompressTime`, `DistributeRowsIntoChannelsTime`, `SplitBlockHashComputeTime`, `LocalSendTime`, `BytesSent`, `LocalBytesSent`, `LocalSentRows`, `RpcMaxTime`/`RpcAvgTime`, and throughput. High RPC time with non-trivial bytes is exchange/network evidence even when local serialization is small. `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, and `WaitForLocalExchangeBufferN` are buffer/backpressure waits. + +`LOCAL_EXCHANGE_OPERATOR` and `LOCAL_EXCHANGE_SINK_OPERATOR` reshuffle data inside one BE. Prioritize hash computation, copy/push/pop timers, local bytes/rows, and skew. Local exchange waits indicate sender/receiver imbalance or downstream pressure. + +`LOCAL_MERGE_SORT_SOURCE_OPERATOR` receives locally sorted streams. Prioritize merge/get-next/sorted output active timers and rows. A large dependency wait usually means it waited for sort sink completion. + +`MULTI_CAST_DATA_STREAM_SINK_OPERATOR` and `MULTI_CAST_DATA_STREAM_SOURCE_OPERATOR` fan out data to multiple consumers. Prioritize copied bytes/rows, channel buffer state, source wait for runtime filters when present, and downstream imbalance. Wait counters identify which consumer/channel delayed progress. + +## Aggregation Operators + +`AGGREGATION_SINK_OPERATOR`, `AGGREGATION_OPERATOR`, `BUCKETED_AGGREGATION_SINK_OPERATOR`, `BUCKETED_AGGREGATION_OPERATOR`, `PARTITIONED_AGGREGATION_SINK_OPERATOR`, and `PARTITIONED_AGGREGATION_OPERATOR` perform hash/merge aggregation. Prioritize `InputRows`, `RowsProduced`, `BuildTime`, `HashTableComputeTime`, `HashTableEmplaceTime`, `HashTableInputCount`, `HashTableSize`, `MergeTime`, `DeserializeAndMergeTime`, `GetResultsTime`, `HashTableIterateTime`, memory counters, and spill counte [...] + +`STREAMING_AGGREGATION_OPERATOR` aggregates when input order/streaming mode allows less state. Prioritize `StreamingAggTime`, build/hash table counters, input/output rows, and whether memory/hash table is unexpectedly high. + +`DISTINCT_STREAMING_AGGREGATION_OPERATOR` handles distinct streaming aggregation. Prioritize distinct hash table compute/emplace, input rows, hash table size, and rows produced. + +## Join Operators + +`HASH_JOIN_SINK_OPERATOR` builds hash join state and runtime filters. Prioritize `InputRows`, `BuildTime`, `BuildHashTableTime`, `BuildTableInsertTime`, `BuildExprCallTime`, `MergeBuildBlockTime`, `MemoryUsageHashTable`, build memory arenas, `PublishTime`, and `RuntimeFilterInfo`. Build dependency waits are sequencing state, not build CPU. + +`HASH_JOIN_OPERATOR` probes hash join state. Prioritize `ProbeRows`, `ProbeIntermediateRows`, `ProbeExprCallTime`, `ProbeWhenSearchHashTableTime`, non-equi conjunct time, output block build time, rows produced, and probe memory. Long wait for build dependency means probe waited for the build side. + +`PARTITIONED_HASH_JOIN_SINK_OPERATOR` and `PARTITIONED_HASH_JOIN_PROBE_OPERATOR` are partitioned/spill-capable join paths. Prioritize build/probe rows, in-memory bytes, spill repartition/build/probe/recovery timers, spill rows/blocks, partition level/count, and memory. + +`CROSS_JOIN_SINK_OPERATOR` and `CROSS_JOIN_OPERATOR` are nested-loop style join paths. Prioritize build rows, probe rows, intermediate rows, `LoopGenerateJoin`, join conjunct evaluation/filter time, and output temp block time. Large output cardinality is often the root cause. If the cross join is only stitching together one-row scalar-subquery results, large dependency waits on the cross join are branch fan-in latency; inspect the child branch scans/joins/aggregations before blaming cros [...] + +`NestedLoopJoinBuildSinkOperatorX` and `NestedLoopJoinProbeOperatorX` appear as cross join operators in profiles. Interpret them with the cross-join counters. + +## Sort and Analytic Operators + +`SORT_SINK_OPERATOR` receives rows and sorts them. Prioritize input rows, sort/append timers, memory usage for sort blocks, and spill counters. + +`SORT_OPERATOR` outputs sorted rows. Prioritize merge/get-next timers, rows produced, output bytes, and source dependency waits only as sink-completion waits. + +`SPILL_SORT_SINK_OPERATOR` and `SPILL_SORT_SOURCE_OPERATOR` are sort paths with spill. Prioritize common spill counters, merge sort time, read/write file times, spilled bytes/rows, and memory. + +`PARTITION_SORT_SINK_OPERATOR` and `PARTITION_SORT_OPERATOR` sort by partition for analytic/window work. Prioritize hash table build, sorted data time, sorted partition input/output rows, and get-sorted time. + +`ANALYTIC_EVAL_SINK_OPERATOR` and `ANALYTIC_EVAL_OPERATOR` evaluate window functions. Prioritize input rows, partition/order/range compute timers, evaluation time, search timers, remove rows/time, blocks peak, and rows produced. Wait for dependency usually means waiting for buffered window state. + +## Set, Union, Repeat, and Projection Operators + +`UNION_SINK_OPERATOR` and `UNION_OPERATOR` combine child outputs. Prioritize input/output rows, copy/projection time, and child imbalance. Union is rarely the bottleneck unless it moves many rows or projects expensive expressions. + +`INTERSECT_OPERATOR`, `EXCEPT_OPERATOR`, `SET_SINK_OPERATOR`, and `SET_PROBE_SINK_OPERATOR` build/probe set state. Prioritize build/probe/get-data/filter timers, hash table rows, extracted rows, memory, and rows produced. + +`REPEAT_OPERATOR` expands rows for grouping sets/rollups. Compare input rows to rows produced and output bytes before blaming later exchange or aggregation. A repeat/rollup can create hundreds of millions of rows; the repeat operator's own active time may be moderate, while the real cost appears immediately after it as local hash shuffle, data stream exchange, and aggregation over the expanded rows. In that pattern, downstream wait counters are symptoms and the proof is repeat output vol [...] + +`MATERIALIZATION_OPERATOR` materializes slots/projections. Prioritize projection/expression time, output rows/bytes, child `RowIDFetcher` profile info, `MergeResponseTime`, and fetched row-id volume. A high materialization time can also be top/lazy materialization: the operator sends row-id based `multiget_data_v2` RPCs to fetch deferred columns after upstream pruning/top-N. In that case `MaxRpcTime` is remote row-id fetch latency for a batch, not local materialization CPU. Do not promot [...] + +`SELECT_OPERATOR` applies predicates/projections. Prioritize predicate/expression/projection active timers and rows filtered/produced. + +`TABLE_FUNCTION_OPERATOR` expands rows through table functions. Compare input rows to output rows and expression/table-function evaluation time. + +`ASSERT_NUM_ROWS_OPERATOR` verifies scalar-subquery row count. Usually not a performance bottleneck; check rows produced and assertion metadata if it appears. + +## Recursive CTE Operators + +`REC_CTE_SOURCE_OPERATOR`, `REC_CTE_SINK_OPERATOR`, `REC_CTE_ANCHOR_SINK_OPERATOR`, and `REC_CTE_SCAN_OPERATOR` implement recursive CTE data flow. Prioritize iteration counts if present, rows in/out per stage, materialized state size, and waits between recursive producer/consumer dependencies. Long waits usually indicate iteration sequencing. + +## Result and Output Sinks + +`RESULT_SINK_OPERATOR` and `ResultSink` return rows to the client. Prioritize bytes/rows sent, append/serialize/send timers, and client/network effects. A slow client can make the result sink look slow. + +`RESULT_FILE_SINK_OPERATOR` writes query results to files. Prioritize `FileWriteTime`, `FileWriterCloseTime`, bytes, file count, and serialization/compression. + +`MEMORY_SCRATCH_SINK_OPERATOR` writes to in-memory scratch/results. Prioritize conversion timers such as Arrow batch conversion, rows, and bytes. + +`BLACKHOLE_SINK_OPERATOR` consumes output without writing. Prioritize `BytesProcessed`/rows only as volume; it should not be I/O-bound. + +## Table and Connector Sinks + +`OLAP_TABLE_SINK_OPERATOR`, `OlapTableSinkOperatorX`, and `OlapTableSinkV2OperatorX` write into Doris storage. Prioritize append/channel timers, rowset build, close/load/commit timers, tablet/channel skew, rows/bytes, and memory. Close/commit only proves sink cost when writer timers agree. + +`GROUP_COMMIT_OLAP_TABLE_SINK_OPERATOR` and `GROUP_COMMIT_BLOCK_SINK_OPERATOR` write through group commit. Prioritize append blocks, group commit flush/queue/commit timers, rows/bytes, and close/wait states. + +`HIVE_TABLE_SINK_OPERATOR`, `ICEBERG_TABLE_SINK_OPERATOR`, `SPILL_ICEBERG_TABLE_SINK_OPERATOR`, `ICEBERG_DELETE_SINK_OPERATOR`, `ICEBERG_MERGE_SINK_OPERATOR`, `MAXCOMPUTE_TABLE_SINK_OPERATOR`, `JDBC_TABLE_SINK_OPERATOR`, and `TVF_TABLE_SINK_OPERATOR` write to external systems. Prioritize connector write time, file write/close time, delete-file/write-file counts, commit/close timers, bytes/rows, and spill counters for spill-capable paths. External system latency may dominate. + +`DICTIONARY_SINK_OPERATOR` builds/publishes dictionary data. Prioritize build/publish/write timers, rows, bytes, and memory. + +`CACHE_SINK_OPERATOR` writes cacheable intermediate data. Prioritize bytes written, cache write timers, tablet/cache metadata, and memory. + +## Source Coverage Index + +The narrative sections above cover these factory-created plan nodes and data sinks from the local source inventory: + +Plan nodes: + +`OLAP_SCAN_NODE`, `GROUP_COMMIT_SCAN_NODE`, `JDBC_SCAN_NODE`, `FILE_SCAN_NODE`, `EXCHANGE_NODE`, `AGGREGATION_NODE`, `BUCKETED_AGGREGATION_NODE`, `HASH_JOIN_NODE`, `CROSS_JOIN_NODE`, `UNION_NODE`, `SORT_NODE`, `PARTITION_SORT_NODE`, `ANALYTIC_EVAL_NODE`, `MATERIALIZATION_NODE`, `INTERSECT_NODE`, `EXCEPT_NODE`, `REPEAT_NODE`, `TABLE_FUNCTION_NODE`, `ASSERT_NUM_ROWS_NODE`, `EMPTY_SET_NODE`, `DATA_GEN_SCAN_NODE`, `SCHEMA_SCAN_NODE`, `META_SCAN_NODE`, `SELECT_NODE`, `REC_CTE_NODE`, `REC_CTE_ [...] + +Data sinks: + +`DATA_STREAM_SINK`, `RESULT_SINK`, `DICTIONARY_SINK`, `GROUP_COMMIT_OLAP_TABLE_SINK`, `GROUP_COMMIT_BLOCK_SINK`, `HIVE_TABLE_SINK`, `ICEBERG_TABLE_SINK`, `ICEBERG_DELETE_SINK`, `ICEBERG_MERGE_SINK`, `MAXCOMPUTE_TABLE_SINK`, `JDBC_TABLE_SINK`, `MEMORY_SCRATCH_SINK`, `RESULT_FILE_SINK`, `MULTI_CAST_DATA_STREAM_SINK`, `BLACKHOLE_SINK`, `TVF_TABLE_SINK`, `OLAP_TABLE_SINK`. + +Operator classes: + +`AggSinkOperatorX`, `AggSourceOperatorX`, `AnalyticSinkOperatorX`, `AnalyticSourceOperatorX`, `AssertNumRowsOperatorX`, `BucketedAggSinkOperatorX`, `BucketedAggSourceOperatorX`, `CacheSourceOperatorX`, `DataGenSourceOperatorX`, `DictSinkOperatorX`, `DistinctStreamingAggOperatorX`, `EmptySetSourceOperatorX`, `ExchangeSinkOperatorX`, `ExchangeSourceOperatorX`, `FileScanOperatorX`, `GroupCommitBlockSinkOperatorX`, `GroupCommitOperatorX`, `HashJoinBuildSinkOperatorX`, `HashJoinProbeOperatorX [...] + +Named operators outside the factory list but present in profile/source registration: + +`CACHE_SINK_OPERATOR`, `CACHE_SOURCE_OPERATOR`, `DISTINCT_STREAMING_AGGREGATION_OPERATOR`, `LOCAL_EXCHANGE_OPERATOR`, `LOCAL_EXCHANGE_SINK_OPERATOR`, `LOCAL_MERGE_SORT_SOURCE_OPERATOR`, `MULTI_CAST_DATA_STREAM_SOURCE_OPERATOR`, `PARTITIONED_AGGREGATION_OPERATOR`, `PARTITIONED_AGGREGATION_SINK_OPERATOR`, `PARTITIONED_HASH_JOIN_PROBE_OPERATOR`, `PARTITIONED_HASH_JOIN_SINK_OPERATOR`, `REC_CTE_ANCHOR_SINK_OPERATOR`, `SET_PROBE_SINK_OPERATOR`, `SET_SINK_OPERATOR`, `SPILL_ICEBERG_TABLE_SINK_OP [...] + +## Fallback for Unknown Operator Names + +If a profile contains an operator not listed above: + +1. Search `source-profile-inventory.md` for the exact operator name or class. +2. Use its source path to identify the family: scan, exchange, join, aggregation, sort, sink, set, metadata, or utility. +3. Classify each counter by `counter-semantics.md`. +4. Do not infer bottleneck from wait counters without active work or volume evidence. diff --git a/skills/doris-profile-reader/references/reading-workflow.md b/skills/doris-profile-reader/references/reading-workflow.md new file mode 100644 index 0000000..6593401 --- /dev/null +++ b/skills/doris-profile-reader/references/reading-workflow.md @@ -0,0 +1,135 @@ +# Doris Profile Reading Workflow + +## 1. Confirm Profile Shape + +Start from the query id, elapsed time, SQL, FE `Summary`, `Execution Profile`, `MergedProfile`, and `DetailProfile`. + +- Use `MergedProfile` first to find candidate fragments/operators quickly. +- Use `DetailProfile` to check skew, per-instance outliers, and whether a large merged time is just parallel accumulation. +- Use the plan text to map operator names back to tables, join sides, grouping keys, exchanges, limits, and runtime-filter direction. +- If the plan has joins, keep two layers in the analysis: immediate runtime bottleneck and plan-shape cause. A scan or hash build may be the active cost while a bad join order/RF direction is the reason that cost exists. +- Keep query-lifecycle time separate from BE operator work. FE `Plan Time`, `Schedule Time`, `Wait and Fetch Result Time`, and client/result fetch time can explain wall-clock latency, especially for small or empty queries, but they are not scan/join/aggregation CPU. If they dominate, report them as a separate lifecycle finding before ranking BE operators. +- If `Is Cached: Yes`, `Total Instances Num: 0`, or the BE detail is absent, the profile is not enough for operator bottleneck analysis. Re-run with profile enabled and SQL cache disabled. + +## 2. Build a Counter Classification Before Ranking + +Classify each large counter before interpreting it: + +- Active work: `ExecTime` after qualifying child wait evidence, plus direct operator custom timers such as scan CPU, hash table build/probe, sort, aggregation, serialization, compression, I/O, decompression, spill read/write. +- Data volume: `InputRows`, `RowsProduced`, `RowsRead`, `ScanRows`, `ScanBytes`, bytes sent/received, blocks produced. +- Wait/backpressure: `WaitForDependencyTime`, `WaitForDependency[...]Time`, `WaitForDataN`, `DataArrivalWaitTime`, `FirstBatchArrivalWaitTime`, `WaitForRpcBufferQueue`, `WaitForBroadcastBuffer`, local-exchange buffer waits, `PendingFinishDependency`, `WaitWorkerTime`, memory reserve waits. +- Resource pressure: memory peak, spill counters, scanner queue wait, remote/local bytes, cache miss/read timers. +- Optimizer/context info: join type, colocate/shuffle flags, partitioner, grouping keys, runtime-filter info, predicates, projections. +- Query lifecycle/context: FE plan time, schedule time, wait/fetch result time, result sink/client fetch time. Useful for wall-clock explanation, not proof of BE operator cost. + +Only active work, data volume, and resource pressure can directly support a bottleneck claim. Wait/backpressure counters explain pipeline shape and symptoms, not direct operator cost by themselves. + +## 3. Rank Candidate Bottlenecks + +Use this priority order: + +1. Query elapsed and the largest active operator families. +2. Large rows/bytes that explain the active time. +3. Skew in `max` versus `avg` or a single instance much slower than peers. +4. Spill, memory peak, and queue/thread waits. +5. Runtime-filter effectiveness and wait. +6. Network/exchange serialization, compression, and remote bytes. +7. Init/open/close only if they remain large after excluding parallel accumulation and waits. + +Do not rank by the largest wait-like counter. A profile where `EXCHANGE_OPERATOR WaitForData0` is the largest counter commonly means the exchange waited for upstream data. + +When several scans or branches each contribute meaningful rows, bytes, memory, or active time, do not force a single-operator bottleneck. First summarize the branch family or operator family, then name the strongest member only as an example or leading contributor. + +After ranking the direct cost, perform a join-order pass for multi-join plans. Identify build/probe sides, RF source/target sides, and whether a more selective branch was scheduled too late to prune a large scan or build. Use `join-order-diagnosis.md` for the decision rules. + +## 4. Interpret Merged Timers Correctly + +Merged profile values often include `sum`, `avg`, `max`, and `min`. Treat them differently: + +- `sum` can exceed wall-clock elapsed time because it accumulates across parallel fragments, drivers, scan ranges, and scanner threads. +- `max` is usually better for wall-clock suspicion. +- `avg` plus `max` reveals skew. +- `min` is useful only for skew context. + +If an operator appears many times, a long summed `ExecTime` may represent normal parallel CPU. Compare per-instance `max` to query elapsed and compare rows/bytes per instance. + +## 5. Read Pipeline Waits as Dependencies + +Pipeline tasks register wait/dependency counters for start prerequisites, upstream data readiness, downstream write availability, memory reservation, runtime filters, and local/remote queues. These counters answer "what was this driver waiting for", not "where CPU was spent". + +Examples: + +- `WaitForDependency[OLAP_SCAN_OPERATOR_DEPENDENCY]Time`: scan source was blocked by its dependency state. +- `WaitForDependency[HASH_JOIN_BUILD_DEPENDENCY]Time`: probe/build sequencing or shared build state dependency. +- `WaitForData0`: exchange source waited for sender queue data. +- `WaitForRpcBufferQueue`: exchange sink waited for RPC buffer capacity. +- `WaitForLocalExchangeBufferN`: sender waited for a local exchange channel. +- `WaitWorkerTime`: pipeline task waited for worker scheduling. +- `ScannerWorkerWaitTime`: scanners waited in the scan worker pool. + +Large waits are actionable only after you identify the upstream/downstream resource that caused them. + +## 6. Diagnose by Operator Family + +Use the operator guide and choose the family-specific proof: + +- Scan bottleneck: high scan active time plus large `ScanRows`/`ScanBytes`, high `ScannerCpuTime`, I/O/decompression/lazy-read/predicate timers, or scanner queue wait. +- Scan/RF latency trap: if scan `ExecTime` is close to `RuntimeFilterInfo` `RFx WaitTime`, `AcquireRuntimeFilter`, or `WaitForRuntimeFilter`, treat the scan as waiting for filters, not spending CPU in storage. Detail profiles can expose `WaitForRuntimeFilter` even when merged `RFx WaitTime` is zero or the filter timed out. `TOPN OPT`, `TopNFilterSourceNodeIds`, or `SharedPredicate` on a scan should be read as the same filter-wait family before blaming storage CPU. Then decide whether the [...] +- Join bottleneck: high build/probe active time, large build/probe rows, hash table memory, non-equi conjunct time, runtime filter publish/build, or spill in partitioned join. +- Aggregation bottleneck: high build/merge/hash-table times, large `InputRows`, large hash table size/memory, or spill. +- Sort/window bottleneck: high sort/evaluation active time, large input rows, memory pressure, or spill. +- Exchange bottleneck: high serialization/compression/send/receive/merge time, high `RpcMaxTime`/`RpcAvgTime` on `DATA_STREAM_SINK_OPERATOR`, and bytes. Data-arrival waits alone are not enough. +- Sink/write bottleneck: high append/write/commit/close/load timers and bytes/rows. Output `PendingFinishDependency` alone is not enough. + +For repeated CTEs, repeated common subplans, expanded `UNION ALL` branches, or scalar-subquery fan-in plans, do not summarize from one representative branch. Build a small branch inventory: table/source, scan rows/bytes, rows produced, major join/aggregation rows, memory/spill evidence, and whether the apparent time is active work or RF/dependency wait. Then aggregate the pattern mentally. Repeated branches can make the true bottleneck "many medium scans plus RF waits" rather than one si [...] + +Small dimension scans often show high apparent `ExecTime` or detail scanner timers because they waited for runtime filters or because merged counters accumulate across many tiny scanners. Before promoting a dimension scan over a fact branch, compare rows/bytes, branch role, output impact, and whether the timer is mostly RF wait. If the fact branches dominate row volume or downstream aggregation, report the dimension scan as supporting/filtering work instead of the main bottleneck. + +## 7. Runtime Filter Pass + +If runtime filters appear, always answer: + +- Which join/build side produced the filter. +- Which scan/probe side applied it. +- Whether it filtered rows, was always true, arrived too late, or forced meaningful wait. +- Whether a long wait is justified by saved scan work. + +See `runtime-filters.md`. + +## 8. Join Order Pass + +For every multi-join profile, always answer: + +- Which side each important join builds on, and which side probes. +- Whether the build side is unexpectedly large compared with the probe/other available side. +- Whether runtime filters flow from a selective side to a large scan early enough to save work. +- Whether paired profiles, hints, rewritten SQL, memo estimates, or actual rows prove a better order. +- Whether memo contains an unchosen reversed join expression, or schema/distribution keys show that the current RF source should instead be an RF target. + +If the active bottleneck is a scan or hash build that would disappear under a better join order, report the scan/build as the runtime bottleneck and the join order/RF direction as the likely root cause. + +When only a slow profile is available, still make a join-order judgment. Use "likely" when the evidence is circumstantial, but do not answer "not proven" merely because there is no paired fast profile. Empty probes, empty final results after huge build/scan work, huge intermediates later eliminated, and RF-source inversion are enough to call the order likely bad when build/probe/RF sides and row counts are visible. + +Use this red-flag pass before writing the conclusion: + +- Expensive RF source emits empty/tiny RF and all targets skip: join order/RF direction is likely bad, not proven good. +- Huge build/shuffle before zero/tiny probe or result: build/probe order or build scheduling is likely bad, even if another predicate explains why the probe is empty. +- Huge intermediate before a contradictory or highly selective inner/non-equi join: join order/predicate placement is bad. +- Tiny semi/subquery key set is applied after a large fact scan/build: join order/RF timing is likely bad. + +If one of those red flags matches, the conclusion must say `bad` or `likely bad`. Use `not proven` only for the exact alternate legal plan or repair, not for whether the observed order is wrong. + +If your reasoning says "plan shape", "predicate placement", "empty probe", "late contradictory predicate", or "expensive RF source", translate that into an explicit join-order diagnosis. The required output is not complete until it says whether the join order/build-probe/RF direction is good, suspicious, or bad. + +See `join-order-diagnosis.md`. + +## 9. Output Standard + +A good profile explanation is evidence-bounded: + +- Quote exact operator and counter names. +- State whether each quoted timer is active, wait, or accumulated. +- Tie scan/join/exchange metrics to table or plan node when possible. +- For joins, include a short build/probe/RF-direction judgment, even when the direct bottleneck is a scan. +- Avoid generic tuning advice unless the profile proves the need. +- End with missing evidence if the profile is incomplete. diff --git a/skills/doris-profile-reader/references/runtime-filters.md b/skills/doris-profile-reader/references/runtime-filters.md new file mode 100644 index 0000000..03ad91d --- /dev/null +++ b/skills/doris-profile-reader/references/runtime-filters.md @@ -0,0 +1,92 @@ +# Runtime Filter Interpretation + +## Direction + +Doris runtime filters are generated from join build-side expressions and applied on target/probe-side scan expressions when the planner can map the slots. + +Plan notation: + +- `RF000[min_max] <- build_expr`: the filter is produced by this join/build side. +- `RF000[min_max] -> target_expr`: the filter is applied to this scan/target side. +- `RFs=[RF0 build->target]`: physical plan join-side relationship. +- `JRFs= RF0` on a scan: the scan has joined runtime filters to apply. + +Do not invert these. The arrow into the RF (`<-`) is source/build. The arrow out of the RF (`->`) is target/scan. + +## Where to Look in Profiles + +Build/source side: + +- Join sink or build operator `RuntimeFilterInfo`. +- `PublishTime` when publishing filters is visible. +- Build-side row count, join type, hash table size, and build time. + +Target/scan side: + +- Scan `WaitForRuntimeFilter` or `AcquireRuntimeFilter`. +- `RuntimeFilterInfo` entries such as `RF0 InputRows`, `RF0 FilterRows`, `RF0 AlwaysTrueFilterRows`, `RF0 WaitTime`. +- Top-N or dynamic filter evidence such as `TOPN OPT`, `TopNFilterSourceNodeIds`, and `SharedPredicate`. These can add `WaitForRuntimeFilter` even when individual merged `RFx WaitTime` counters are zero or the filter state is `TIMEOUT`. +- Scan rows/bytes and row-filter counters before deciding whether the RF helped. +- If scan `ExecTime` is approximately equal to RF wait counters, the scan was mostly waiting for runtime filters. Do not report that as scanner CPU or storage I/O without supporting `ScannerCpuTime`, `IOTimer`, `DecompressorTimer`, or `ScanBytes` evidence. +- `RFx WaitTime`, scan-level `WaitForRuntimeFilter`, `AcquireRuntimeFilter`, and scanner child timers are not guaranteed to be additive. They can overlap, be per-filter, or be accumulated at different scopes. Use them to classify latency and causality, not to compute exact residual CPU by subtraction. +- `RFx InputRows`, `RFx FilterRows`, and `RFx AlwaysTrueFilterRows` are per-filter/parallel counters. Do not add all filters together as if each row were counted once. +- Storage-pushed RF pruning may be visible as `RowsKeyRangeFiltered`, `RowsStatsFiltered`, `RowsZoneMapRuntimePredicateFiltered`, `RowsBloomFilterFiltered`, or related scan-detail counters. A zero `RFx FilterRows` does not by itself prove the RF was useless. + +Exchange/sink side: + +- Runtime filters may be carried through data stream sinks between fragments. Use the plan to connect source and target. +- Multicast/local-exchange sources can expose runtime-filter waits while distributing data to multiple consumers. Classify those waits as consumer/filter synchronization first; then identify which downstream scan or branch actually benefited. + +## Effectiveness Rules + +Useful filter: + +- `FilterRows` is large relative to `InputRows`. +- Scan `RowsRead`/downstream rows are much lower than `ScanRows`. +- Scan-detail pruning counters such as key-range, zone-map, bloom, stats, or condition-filter rows are large. +- Filter wait is small compared with saved scan/join work. +- The target table is large enough that pruning matters. +- The filter is produced by a small/selective build side early enough to avoid large target-side scan or join-build work. + +Weak or useless filter: + +- `AlwaysTrueFilterRows` dominates. +- `FilterRows = 0` or tiny relative to `InputRows`. +- `WaitTime` is visible but no meaningful rows are filtered. +- Filter arrives after most scan work, or scan had no large table to prune. + +Potentially harmful filter: + +- Long `WaitForRuntimeFilter`/`AcquireRuntimeFilter` and little filtering. +- Scan `ExecTime` dominated by `RFx WaitTime`, especially on a small dimension table or a branch whose target-side filtering is weak. +- Broadcast or build side is slow, causing scan-side idle time without saving I/O. +- Many filters add predicate overhead while scan rows are already small. +- The filter source is available only after the large table has already been scanned or built into a hash table; this usually points to a join-order/RF-direction problem rather than a scan implementation problem. +- The expensive scan produces the RF, while other large scans wait for it and then skip. If the RF source had to scan huge data to produce an empty/tiny filter, question the source/target choice; the source side may be the side that should have received an earlier RF instead. +- Empty/tiny filter output after a massive source scan is a red flag, not a success signal. Classify the target-side skip as saved downstream work, but classify the source-side massive work as likely bad RF direction unless concrete plan evidence rules out an earlier pruning direction. + +## Filter Types + +- Bloom/in filter: good for equality join pruning; check false positive and target cardinality indirectly by rows filtered. +- Min/max filter: cheap range pruning; useful only when build-side min/max is selective for the target. +- Bitmap filter: specialized nested-loop/bitmap path; check build bitmap construction and target filtering. +- IN/IN_OR_BLOOM variants: planner/runtime chooses representation based on cardinality and settings. + +The filter type alone does not prove benefit. Always use target-side rows and wait. + +## Common Mistakes + +- Mistaking source/build side for target/scan side because of arrows. +- Treating `WaitForRuntimeFilter` as scan CPU. +- Calling a runtime filter effective from its existence only. +- Treating an active large scan as a pure scan problem without asking whether an earlier RF from a different join order should have pruned it. +- Calling an empty RF useful without asking whether the plan paid too much to produce that empty RF. +- Calling a target-side skip proof of good order while ignoring that source-side scan/build dominated the query. +- Ignoring `AlwaysTrueFilterRows`, which often means the filter was not selective. +- Ignoring that filter counters may be accumulated across parallel scan instances. +- Calling a filter useless only because one `RFx FilterRows` is zero while scan-detail key-range, zone-map, bloom, or stats pruning removed rows. +- Subtracting RF wait counters mechanically from `ExecTime` or from each other despite overlapping scopes. + +## Example Pattern + +In one calibration profile, the plan showed min/max filters like `RF000[min_max] <- build_key` at the build side and `RF000[min_max] -> scan_key` at a scan. The profile had `RuntimeFilterInfo` with RF wait/filter counters, but the filters waited only tens of milliseconds and filtered no meaningful rows. The real evidence was large scan volume and active aggregation/exchange work, not runtime filter wait. diff --git a/skills/doris-profile-reader/scripts/extract_source_profile_inventory.py b/skills/doris-profile-reader/scripts/extract_source_profile_inventory.py new file mode 100755 index 0000000..55e9097 --- /dev/null +++ b/skills/doris-profile-reader/scripts/extract_source_profile_inventory.py @@ -0,0 +1,158 @@ +#!/usr/bin/env python3 + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""Extract Doris profile operator/counter names from the current source tree.""" + +from __future__ import annotations + +import argparse +import re +from collections import defaultdict +from pathlib import Path + + +PROFILE_CONST_RE = re.compile(r'inline constexpr char\s+([A-Z0-9_]+)\[\]\s*=\s*"([^"]+)"') +MACRO_RE = re.compile( + r"\b(ADD_(?:CHILD_)?(?:TIMER|COUNTER)(?:_WITH_LEVEL)?|add_nonzero_counter|add_info_string|add_derived_counter)\s*\(" +) +STRING_RE = re.compile(r'"([^"]+)"') +NAME_ASSIGN_RE = re.compile(r'\b(?:_name|_op_name)\s*=\s*"([^"]+)"') +CLASS_RE = re.compile(r"\bclass\s+([A-Za-z0-9_]*OperatorX|[A-Za-z0-9_]*LocalState)\b") +FACTORY_RE = re.compile(r"std::make_shared<([A-Za-z0-9_]+OperatorX)>") +CASE_RE = re.compile(r"case\s+(T(?:PlanNode|DataSink)Type::[A-Z0-9_]+)") + + +def rel(path: Path, root: Path) -> str: + return str(path.relative_to(root)) + + +def load_profile_constants(root: Path) -> dict[str, str]: + header = root / "be/src/runtime/runtime_profile_counter_names.h" + constants = {} + for match in PROFILE_CONST_RE.finditer(header.read_text(encoding="utf-8")): + constants[f"profile::{match.group(1)}"] = match.group(2) + return constants + + +def extract_macro_args(line: str, start: int) -> str: + depth = 0 + out = [] + for char in line[start:]: + if char == "(": + depth += 1 + elif char == ")": + depth -= 1 + if depth == 0: + out.append(char) + break + out.append(char) + return "".join(out) + + +def counter_name_from_args(args: str, constants: dict[str, str]) -> str | None: + strings = STRING_RE.findall(args) + if strings: + # For ADD_CHILD_TIMER(profile, child, parent), the first string is the counter name. + return strings[0] + for const_name, value in constants.items(): + if const_name in args: + return value + return None + + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--repo", default="/mnt/disk3/zhaochangle/doris") + parser.add_argument("--output", default="/tmp/doris-profile-reader/source-profile-inventory.md") + args = parser.parse_args() + + root = Path(args.repo) + constants = load_profile_constants(root) + source_roots = [ + root / "be/src/exec/operator", + root / "be/src/exec/pipeline", + root / "be/src/exec/scan", + root / "be/src/exec/exchange", + root / "be/src/exec/sink", + root / "be/src/storage", + root / "be/src/runtime", + root / "fe/fe-core/src/main/java/org/apache/doris/common/profile", + root / "fe/fe-core/src/main/java/org/apache/doris/nereids/processor/post", + root / "fe/fe-core/src/main/java/org/apache/doris/planner", + ] + files = [] + for source_root in source_roots: + if source_root.exists(): + files.extend(p for p in source_root.rglob("*") if p.suffix in {".cpp", ".h", ".java"}) + + counters: dict[str, list[str]] = defaultdict(list) + names: dict[str, list[str]] = defaultdict(list) + classes: dict[str, list[str]] = defaultdict(list) + factories: dict[str, list[str]] = defaultdict(list) + cases: dict[str, list[str]] = defaultdict(list) + + for path in sorted(files): + text = path.read_text(encoding="utf-8", errors="ignore") + relative = rel(path, root) + for line_no, line in enumerate(text.splitlines(), 1): + for match in MACRO_RE.finditer(line): + args_text = extract_macro_args(line, match.end() - 1) + counter_name = counter_name_from_args(args_text, constants) + if counter_name: + counters[counter_name].append(f"{relative}:{line_no}") + for match in NAME_ASSIGN_RE.finditer(line): + names[match.group(1)].append(f"{relative}:{line_no}") + for match in CLASS_RE.finditer(line): + classes[match.group(1)].append(f"{relative}:{line_no}") + for match in FACTORY_RE.finditer(line): + factories[match.group(1)].append(f"{relative}:{line_no}") + for match in CASE_RE.finditer(line): + cases[match.group(1)].append(f"{relative}:{line_no}") + + lines = ["# Doris Source Profile Inventory", ""] + lines.append("## Factory Cases") + for key in sorted(cases): + lines.append(f"- `{key}`: {', '.join(cases[key][:6])}") + lines.append("") + + lines.append("## Operator Name Assignments") + for key in sorted(names): + lines.append(f"- `{key}`: {', '.join(names[key][:6])}") + lines.append("") + + lines.append("## Operator Classes In Factory") + for key in sorted(factories): + lines.append(f"- `{key}`: {', '.join(factories[key][:6])}") + lines.append("") + + lines.append("## Counter And Info Names") + for key in sorted(counters): + refs = counters[key] + lines.append(f"- `{key}` ({len(refs)} registrations): {', '.join(refs[:8])}") + lines.append("") + + output = Path(args.output) + output.parent.mkdir(parents=True, exist_ok=True) + output.write_text("\n".join(lines), encoding="utf-8") + print(f"wrote {output}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
