churromorales opened a new pull request, #15620:
URL: https://github.com/apache/lucene/pull/15620

   
   ### Description
   This PR introduces TemporalMergePolicy, a new merge policy designed for 
time-series workloads where documents contain a timestamp field. The policy 
groups segments into time windows and merges segments within the same window, 
but never merges segments across different time windows. This preserves 
temporal locality and  improves query performance for time-range queries. 
relates to  #15412.
   
   ### How it works
   Time Bucketing                                                               
                                                                                
                                                                                
                                                                          
     - Segments are assigned to time windows based on their maximum timestamp:  
                                                                                
                                                                                
                                                                                
        
     - Exponential bucketing (default): Recent data uses small windows (e.g., 1 
hour), older data uses progressively larger windows (4 hours, 16 hours, etc.)   
                                                                                
                                                                                
      
     - Fixed bucketing: All time windows have the same size                     
                                                                                
                                                                                
                                                                                
      
     - Old data bucket: Segments older than maxAgeSeconds are placed in a 
special bucket and not merged       
   
   ### Merge Triggers
   Merges are triggered when a time window meets two conditions:                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                         
     1. Contains at least minThreshold segments (default: 4)                    
                                                                                
                                                                                
                                                                                
      
     2. Total document count exceeds largestSegment * compactionRatio (default: 
1.2)
   
   ### Key Constraints                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                               
     - Never merge across time windows: Even forceMerge(1) respects bucket 
boundaries                                                                      
                                                                                
                                                                                
           
     - Old data protection: Very old segments (configurable via maxAgeSeconds) 
are excluded from merging                                                       
                                                                                
                                                                                
       
     - Concurrency safety: Properly checks MergeContext.getMergingSegments() to 
avoid "segment already merging" errors 
   
   ### Handling Late-Arriving and Out-of-Order Data                             
                                                                                
                                                                                
                                                                                
        
                                                                                
                                                                                
                                                                                
                                                                                
      
     Time-series data rarely arrives perfectly in order. TemporalMergePolicy 
handles various timing scenarios:                                               
                                                                                
                                                                                
         
                                                                                
                                                                                
                                                                                
                                                                                
      
   ####  Late-Arriving Data                                                     
                                                                                
                                                                                
                                                                                
          
                                                                                
                                                                                
                                                                                
                                                                                
      
     When data with older timestamps arrives after newer data has been indexed: 
                                                                                
                                                                                
                                                                                
      
     - Each segment is assigned to a time window based on its **maximum 
timestamp**                                                                     
                                                                                
                                                                                
                  
     - A segment containing mostly recent data with a few old records will be 
placed in the recent bucket                                                     
                                                                                
                                                                                
        
     - A segment containing only old data will be placed in the appropriate 
older bucket                                                                    
                                                                                
                                                                                
          
     - Segments with mixed timestamps (spanning multiple windows) are assigned 
based on their max timestamp                                                    
                                                                                
                                                                                
       
                                                                                
                                                                                
                                                                                
                                                                                
      
     Example:                                                                   
                                                                                
                                                                                
                                                                                
      
   
         Segment A: timestamps [2024-01-01 to 2024-01-02] → Jan 2024 bucket     
                                                                                
                                                                                
                                                                                
          
         Segment B: timestamps [2024-02-01 to 2024-02-02] → Feb 2024 bucket     
                                                                                
                                                                                
                                                                                
          
         Segment C: timestamps [2024-01-15 to 2024-01-16] → Jan 2024 bucket 
(late arrival)    
   
                                                                                
                                                                                
                                                                               
                                                                                
                                                                                
                                                                                
                                                                                
      
     Result: Segments A and C can merge together (same bucket), but never with 
B                                                                               
                                                                                
                                                                                
       
                                                                                
                                                                                
                                                                                
                                                                                
      
   ####  Future Data                                                            
                                                                                
                                                                                
                                                                                
          
                                                                                
                                                                                
                                                                                
                                                                                
      
     Data with timestamps in the future (beyond current time):                  
                                                                                
                                                                                
                                                                                
      
     - Treated as age = 0 (most recent)                                         
                                                                                
                                                                                
                                                                                
      
     - Placed in the smallest (most recent) time window                         
                                                                                
                                                                                
                                                                                
      
     - Prevents errors from clock skew or timestamp bugs                        
                                                                                
                                                                                
                                                                                
      
                                                                                
                                                                                
                                                                                
                                                                                
      
   ####  Out-of-Order Writes Within a Segment                                   
                                                                                
                                                                                
                                                                                
          
                                                                                
                                                                                
                                                                                
                                                                                
      
     If a single segment contains documents spanning multiple time windows:     
                                                                                
                                                                                
                                                                                
      
     - The segment is bucketed by its **max timestamp only**                    
                                                                                
                                                                                
                                                                                
          
     - This prevents pathological cases where a single document with a 
far-future timestamp would prevent merging                                      
                                                                                
                                                                                
               
     - Trade-off: Some temporal mixing can occur within individual segments 
before merging   
   
   I have never committed to lucene before, so I might be doing the logging 
wrong, I added some logging to help others understand how the merging works but 
happy to follow whatever guidelines you guys have for the project. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to