lintingbin commented on issue #4055:
URL: https://github.com/apache/amoro/issues/4055#issuecomment-3797707009

   Thanks @vaquarkhan for the detailed analysis! You've correctly identified 
the "noisy neighbor" problem where active partitions prevent quiet ones from 
being optimized.
   
   However, I think the per-partition tracking approach might be 
**overengineered** for this specific issue. Here are my concerns:
   
   ## Issues with Per-Partition Tracking:
   
   1. **State management overhead**: For tables with many historical partitions 
(e.g., daily partitions over years), we'd need to maintain a large map. Most of 
these entries would be used only once or very rarely, which is wasteful.
   
   2. **Memory/storage overhead**: The `Map<String, Long>` would grow 
indefinitely unless we implement cleanup logic, adding more complexity.
   
   ## Alternative Lightweight Solution:
   
   I propose a **table-level approach** with a simple tweak to the original 
issue's suggestion:
   
   ```java
   protected boolean reachMinorInterval() {
       if (config.getMinorLeastInterval() < 0) {
           return false;
       }
       
       long interval = planTime - lastMinorOptimizingTime;
       
       if (interval > config.getMinorLeastInterval()) {
           return true;
       }
       
       // Ensure minor optimization runs at least once per day
       return isDifferentDay(lastMinorOptimizingTime, planTime);
   }
   ```
   
   **Key advantages:**
   - ✅ Simple: No new state storage required
   - ✅ Effective: Ensures quiet partitions get optimized at least once daily
   - ✅ Backward compatible: No schema changes needed
   - ✅ Low overhead: Minimal logic change
   
   **How it solves the problem:**
   Even if active partitions constantly reset `lastMinorOptimizingTime`, the 
`isDifferentDay()` check ensures that **at least once per day**, partitions 
with just 2-3 small files will have a chance to be optimized.
   
   For most use cases, daily optimization of quiet partitions should be 
sufficient. If needed, the interval can still be configured via 
`minorLeastInterval` for more frequent optimizations.
   
   What do you think? @lintingbin would this simpler approach work for your use 
case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to