peterylh opened a new pull request, #57575:
URL: https://github.com/apache/doris/pull/57575
# Profile Archive Feature
## Summary
Add automatic profile archiving feature to preserve query profiles beyond
current memory and disk limits. When profile storage reaches capacity, old
profiles are automatically archived to compressed ZIP files instead of being
deleted, enabling long-term profile retention for troubleshooting and analysis.
## Problem
Currently, Doris has strict limits on profile storage:
- **Memory profiles**: Maximum 500 profiles (`max_query_profile_num`)
- **Spilled profiles**: Maximum 500 profiles (`max_spilled_profile_num`)
- **Storage size**: Maximum 1GB (`spilled_profile_storage_limit_bytes`)
When these limits are exceeded, old profiles are **permanently deleted**,
making it impossible to analyze historical slow queries beyond the retention
window. This is problematic for:
- Post-incident analysis of production issues
- Long-term performance trend analysis
- Debugging intermittent query problems
## Solution
Implement an automatic profile archiving system that:
1. **Moves** outdated profiles to an archive directory instead of deleting
them
2. **Batches** profiles into compressed ZIP files for efficient storage
3. **Retains** profiles for a configurable period (default 7 days)
4. **Provides** predictable file naming for easy profile location
### Key Features
- **Pending Buffer Strategy**: Profiles are staged in `archive/pending/`
before archiving to ensure optimal batch sizes
- **Dual Trigger Mechanism**:
- Archive when batch size reaches configured limit (default 100 profiles)
- Archive when oldest pending file exceeds timeout (default 24 hours)
- **Automatic Cleanup**: Remove archives older than retention period
(configurable)
- **Graceful Degradation**: Falls back to direct deletion if archiving fails
## Implementation Details
### Directory Structure
```
${LOG_DIR}/profile/
├── {timestamp}_{queryid}.zip # Active spilled profiles
└── archive/ # Archive root
├── pending/ # Staging area for batching
│ └── {timestamp}_{queryid}.zip
└── profiles_20240101_000000_20240101_235959.zip # Archived batches
```
### Archive File Naming
Archive ZIPs follow the naming pattern:
`profiles_{start_timestamp}_{end_timestamp}.zip`
- `start_timestamp`: Earliest profile in the batch (YYYYMMDD_HHMMSS)
- `end_timestamp`: Latest profile in the batch (YYYYMMDD_HHMMSS)
This enables quick location of profiles by query time.
### Workflow
```
Query Profile Creation
↓
Memory Storage (max 500)
↓
Spilled to Disk (when memory full)
↓
Periodic Cleanup (every 1s)
↓
Move to archive/pending/ (when limits exceeded)
↓
Archive to ZIP (batch size reached OR timeout exceeded)
↓
Delete Pending Files
↓
Cleanup Old Archives (every 24h, default retention 7 days)
```
### Code Changes
**New Files:**
- `ProfileArchiveManager.java` (682 lines) - Core archiving logic
**Modified Files:**
- `Config.java` (+35 lines) - Configuration parameters
- `ProfileManager.java` (+157/-17 lines) - Integration with archive system
**Test Files:**
- `ProfileArchiveManagerTest.java` (+1111 lines) - 26 comprehensive test
cases
- `ProfileManagerTest.java` (+227 lines) - Integration tests
## Configuration
All parameters have sensible defaults and can be tuned via FE configuration:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `enable_profile_archive` | boolean | `true` | Enable/disable profile
archiving |
| `profile_archive_batch_size` | int | `100` | Number of profiles per ZIP
file |
| `profile_archive_path` | String | `""` | Custom archive path (empty = use
default `${spilled_profile_storage_path}/archive`) |
| `profile_archive_retention_seconds` | int | `604800` | Archive retention
period in seconds (7 days). Set to `-1` for unlimited retention, `0` to disable
archiving |
| `profile_archive_pending_timeout_seconds` | int | `86400` | Maximum wait
time for pending files in seconds (24 hours). Force archive even if batch is
not full |
### Configuration Examples
```properties
# Increase batch size for larger archives (reduces file count)
profile_archive_batch_size = 1000
# Keep archives for 30 days
profile_archive_retention_seconds = 2592000
# Use custom archive path (e.g., mounted network storage)
profile_archive_path = /mnt/nfs/doris-profiles/archive
# Force archive after 12 hours instead of 24
profile_archive_pending_timeout_seconds = 43200
# Disable archiving (keep current behavior)
enable_profile_archive = false
```
## Usage
### For System Administrators
**Step 1: Locate Slow Query**
```sql
SELECT query_id, time, frontend_ip, query_time
FROM __internal_schema.audit_log
WHERE time >= NOW() - INTERVAL 1 DAY
AND query_time > 10000
ORDER BY query_time DESC;
```
**Step 2: Find Archive File**
```bash
ssh user@<frontend_ip>
cd ${LOG_DIR}/profile/archive
ls -lh profiles_*.zip
```
**Step 3: Extract and Analyze**
```bash
unzip profiles_20240101_120000_20240101_130000.zip -d /tmp/analysis/
ls /tmp/analysis/ | grep <query_id>
vim /tmp/analysis/<timestamp>_<query_id>.profile
```
### Space Management
```bash
# Check archive storage usage
du -sh ${LOG_DIR}/profile/archive
# Manual cleanup (if needed beyond automatic retention)
find ${LOG_DIR}/profile/archive -name "profiles_*.zip" -mtime +90 -delete
```
## Backward Compatibility
- **Fully backward compatible** - existing profile storage continues to work
- **Default enabled** - archives are created automatically
- **Can be disabled** - set `enable_profile_archive = false` to restore old
behavior
- **No schema changes** - no database migration required
## Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [X] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR should
merge into -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]