somandal commented on issue #3014:
URL: https://github.com/apache/helix/issues/3014#issuecomment-2810128880

   @junkaixue @zpinto More interesting findings based on my last comment. I 
performed the following steps on a CUSTOMIZED resource:
   
   -  Force some partitions to move to ERROR state in EV via throwing an 
exception in the StateTransition callback
   - Updated the IdealState for an ONLINE segment to set it to OFFLINE -> to 
trigger a rebalance loop via IS state change
   - All the ERROR partitions get dropped, they don't seem to come back as 
ERROR in the EV even after waiting a while
   - All the instances these ERROR partitions are assigned to are still up and 
running correctly
   
   Maybe I'm misunderstanding the expected behavior, but why are ERROR segments 
deleted in this scenario and what are the implications on how we should handle 
and identify such ERROR cases? If we find a partition missing in EV, does this 
now mean that it might:
   
   1. Be OFFLINE (confirm with IS that OFFLINE is the expected state - treat 
this as a no-action)
   2. Be ERROR (IS is non-OFFLINE)
   3. Not yet be added to the cluster / in the middle of processing the state 
transition (IS is non-OFFLINE)
   
   Especially, how can we differentiate between 2 and 3 above?
   
   Example state transition message received for DROPPING the ERROR segment:
   ```
   2025/04/16 09:22:36.878 ERROR [38_7050 - SegmentOnlineOfflineStateModel] 
[HelixTaskExecutor-message_handle_thread_85] 
SegmentOnlineOfflineStateModel.onBecomeDroppedFromError() : 
ZnRecord=abfa46ea-f2d4-4d96-a2a4-734fc8b8e2b5, {CREATE_TIMESTAMP=1744820556842, 
ClusterEventName=IdealStateChange, EXECUTE_START_TIMESTAMP=1744820556877, 
EXE_SESSION_ID=10052d688ae0018, FROM_STATE=ERROR, 
MSG_ID=abfa46ea-f2d4-4d96-a2a4-734fc8b8e2b5, MSG_STATE=read, 
MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=airlineStats_OFFLINE_16081_16081_0, 
READ_TIMESTAMP=1744820556855, RESOURCE_NAME=airlineStats_OFFLINE, 
RESOURCE_TAG=airlineStats_OFFLINE, RETRY_COUNT=3, SRC_NAME=100.79.216.38_9000, 
SRC_SESSION_ID=10052d688ae0005, STATE_MODEL_DEF=SegmentOnlineOfflineStateModel, 
STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=Server_100.79.216.38_7050, 
TGT_SESSION_ID=10052d688ae0018, TO_STATE=DROPPED}{}{}, Stat=Stat {_version=0, 
_creationTime=1744820556844, _modifiedTime=1744820556844, _ephemeralOwner=0}
   ```
   
   cc @Jackie-Jiang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to