rohityadav1993 opened a new pull request, #17754:
URL: https://github.com/apache/pinot/pull/17754

   - `feature`
   - `release-notes`
   - `bugfix`
   
   ## Description
   
   This change introduces automated repair for partially offline replicas in 
realtime segment consumption. This addresses scenarios(issue: #11314) where 
some replicas fail during initialization (e.g., KafkaConsumer errors) and mark 
themselves OFFLINE while other replicas continue consuming normally.
   
   ### Changes
   
   - Added new configuration flag 
`controller.realtime.segment.partialOfflineReplicaRepairEnabled` (defaults to 
`false`)
   - Enhanced `PinotLLCRealtimeSegmentManager` validation to detect and repair 
mixed CONSUMING/OFFLINE replica states
   - When enabled, controller automatically resets OFFLINE replicas back to 
CONSUMING state for IN_PROGRESS segments, allowing retry
   
   ### Implementation Details
   
   **Configuration:**
   - New property: 
`controller.realtime.segment.partialOfflineReplicaRepairEnabled` in 
`ControllerConf`
   - Default: `false` (opt-in for backward compatibility)
   
   **Repair Logic:**
   - Detects segments with mixed CONSUMING/OFFLINE replica states during 
validation
   - Logs repair actions with details (segment name, offline count, instance 
list)
   - Resets identified OFFLINE replicas to CONSUMING state
   
   **Testing:**
   Unit Tests:
   - Added unit tests for enabled scenario (verifies OFFLINE→CONSUMING 
transition)
   - Added unit tests for disabled scenario (verifies no-op behavior)
   
   Local cluster test pLan:
   - Set up realtime table(kafka) with two servers and two replicas, 
partialOfflineReplicaRepairEnabled = true
   - Mangle DNS config: `echo "nameserver 0.0.0.0" > /etc/resolv.conf` in 
server-1
   - Force commit, new consuming server on server-1 comes up in error state and 
moves to OFFLINE state while server-2 is in CONSUMING state
   - Run controller validation job: RealtimeSegmentValidationManager
   - The replica becomes healthy in CONSUMING state.
   
   
   ## Upgrade Notes
   
   This feature is disabled by default. To enable, set:
   `controller.realtime.segment.partialOfflineReplicaRepairEnabled=true`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to