somandal commented on issue #3014:
URL: https://github.com/apache/helix/issues/3014#issuecomment-2810128880
@junkaixue @zpinto More interesting findings based on my last comment. I
performed the following steps on a CUSTOMIZED resource:
- Force some partitions to move to ERROR state in EV via throwing an
exception in the StateTransition callback
- Updated the IdealState for an ONLINE segment to set it to OFFLINE -> to
trigger a rebalance loop via IS state change
- All the ERROR partitions get dropped, they don't seem to come back as
ERROR in the EV even after waiting a while
- All the instances these ERROR partitions are assigned to are still up and
running correctly
Maybe I'm misunderstanding the expected behavior, but why are ERROR segments
deleted in this scenario and what are the implications on how we should handle
and identify such ERROR cases? If we find a partition missing in EV, does this
now mean that it might:
1. Be OFFLINE (confirm with IS that OFFLINE is the expected state - treat
this as a no-action)
2. Be ERROR (IS is non-OFFLINE)
3. Not yet be added to the cluster / in the middle of processing the state
transition (IS is non-OFFLINE)
Especially, how can we differentiate between 2 and 3 above?
Example state transition message received for DROPPING the ERROR segment:
```
2025/04/16 09:22:36.878 ERROR [38_7050 - SegmentOnlineOfflineStateModel]
[HelixTaskExecutor-message_handle_thread_85]
SegmentOnlineOfflineStateModel.onBecomeDroppedFromError() :
ZnRecord=abfa46ea-f2d4-4d96-a2a4-734fc8b8e2b5, {CREATE_TIMESTAMP=1744820556842,
ClusterEventName=IdealStateChange, EXECUTE_START_TIMESTAMP=1744820556877,
EXE_SESSION_ID=10052d688ae0018, FROM_STATE=ERROR,
MSG_ID=abfa46ea-f2d4-4d96-a2a4-734fc8b8e2b5, MSG_STATE=read,
MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=airlineStats_OFFLINE_16081_16081_0,
READ_TIMESTAMP=1744820556855, RESOURCE_NAME=airlineStats_OFFLINE,
RESOURCE_TAG=airlineStats_OFFLINE, RETRY_COUNT=3, SRC_NAME=100.79.216.38_9000,
SRC_SESSION_ID=10052d688ae0005, STATE_MODEL_DEF=SegmentOnlineOfflineStateModel,
STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=Server_100.79.216.38_7050,
TGT_SESSION_ID=10052d688ae0018, TO_STATE=DROPPED}{}{}, Stat=Stat {_version=0,
_creationTime=1744820556844, _modifiedTime=1744820556844, _ephemeralOwner=0}
```
cc @Jackie-Jiang
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]