churromorales opened a new pull request, #15620: URL: https://github.com/apache/lucene/pull/15620
### Description
This PR introduces TemporalMergePolicy, a new merge policy designed for
time-series workloads where documents contain a timestamp field. The policy
groups segments into time windows and merges segments within the same window,
but never merges segments across different time windows. This preserves
temporal locality and improves query performance for time-range queries.
relates to #15412.
### How it works
Time Bucketing
- Segments are assigned to time windows based on their maximum timestamp:
- Exponential bucketing (default): Recent data uses small windows (e.g., 1
hour), older data uses progressively larger windows (4 hours, 16 hours, etc.)
- Fixed bucketing: All time windows have the same size
- Old data bucket: Segments older than maxAgeSeconds are placed in a
special bucket and not merged
### Merge Triggers
Merges are triggered when a time window meets two conditions:
1. Contains at least minThreshold segments (default: 4)
2. Total document count exceeds largestSegment * compactionRatio (default:
1.2)
### Key Constraints
- Never merge across time windows: Even forceMerge(1) respects bucket
boundaries
- Old data protection: Very old segments (configurable via maxAgeSeconds)
are excluded from merging
- Concurrency safety: Properly checks MergeContext.getMergingSegments() to
avoid "segment already merging" errors
### Handling Late-Arriving and Out-of-Order Data
Time-series data rarely arrives perfectly in order. TemporalMergePolicy
handles various timing scenarios:
#### Late-Arriving Data
When data with older timestamps arrives after newer data has been indexed:
- Each segment is assigned to a time window based on its **maximum
timestamp**
- A segment containing mostly recent data with a few old records will be
placed in the recent bucket
- A segment containing only old data will be placed in the appropriate
older bucket
- Segments with mixed timestamps (spanning multiple windows) are assigned
based on their max timestamp
Example:
Segment A: timestamps [2024-01-01 to 2024-01-02] → Jan 2024 bucket
Segment B: timestamps [2024-02-01 to 2024-02-02] → Feb 2024 bucket
Segment C: timestamps [2024-01-15 to 2024-01-16] → Jan 2024 bucket
(late arrival)
Result: Segments A and C can merge together (same bucket), but never with
B
#### Future Data
Data with timestamps in the future (beyond current time):
- Treated as age = 0 (most recent)
- Placed in the smallest (most recent) time window
- Prevents errors from clock skew or timestamp bugs
#### Out-of-Order Writes Within a Segment
If a single segment contains documents spanning multiple time windows:
- The segment is bucketed by its **max timestamp only**
- This prevents pathological cases where a single document with a
far-future timestamp would prevent merging
- Trade-off: Some temporal mixing can occur within individual segments
before merging
I have never committed to lucene before, so I might be doing the logging
wrong, I added some logging to help others understand how the merging works but
happy to follow whatever guidelines you guys have for the project.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
