ankitsultana opened a new issue, #10342:
URL: https://github.com/apache/pinot/issues/10342
I have seen this behavior quite often in our systems where a segment would
go into error state in one of the servers and a reload doesn't fix the issue.
However if we restart the server then the segment becomes healthy again. On
checking the logs, I often see something like this:
```
INFO org.apache.helix.messaging.handling.HelixTaskExecutor - Scheduling
message db272de0-949c-4d9f-87dd-f4f912f19a38:
my_table_REALTIME:my_table__79__0__20230224T2228Z, null->null
2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO
my_table_REALTIME-SegmentReloadMessageHandler - Handling message:
ZnRecord=db272de0-949c-4d9f-87dd-f4f912f19a38, {CREATE_TIMESTAMP=1677395071480,
EXECUTE_START_TIMESTAMP=1677395072598,
MSG_ID=db272de0-949c-4d9f-87dd-f4f912f19a38, MSG_STATE=new,
MSG_SUBTYPE=RELOAD_SEGMENT, MSG_TYPE=USER_DEFINE_MSG,
PARTITION_NAME=my_table__79__0__20230224T2228Z,
RESOURCE_NAME=my_table_REALTIME, RETRY_COUNT=0, SRC_CLUSTER=...,
SRC_INSTANCE_TYPE=PARTICIPANT, SRC_NAME=Controller_.., TGT_NAME=61f..,
TGT_SESSION_ID=.., TIMEOUT=-1, forceDownload=false}{}{}, Stat=Stat {_version=0,
_creationTime=1677395072613, _modifiedTime=1677395072613, _ephemeralOwner=0}
2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO
my_table_REALTIME-SegmentReloadMessageHandler - Waiting for lock to refresh :
my_table__79__0__20230224T2228Z, queue-length: 0
2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO
my_table_REALTIME-SegmentReloadMessageHandler - Acquired lock to refresh
segment: my_table__79__0__20230224T2228Z (lock-time=0ms, queue-length=0)
2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO
o.apache.pinot.server.starter.helix.HelixInstanceDataManager - Reloading
single segment: my_table__79__0__20230224T2228Z in table: my_table_REALTIME
2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO
o.apache.pinot.server.starter.helix.HelixInstanceDataManager - Segment
metadata is null. Skip reloading segment: my_table__79__0__20230224T2228Z in
table: my_table_REALTIME
2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO
org.apache.helix.messaging.handling.HelixTask - Message
db272de0-949c-4d9f-87dd-f4f912f19a38 completed.
2023-02-26 07:04:32.600 [HelixTaskExecutor-message_handle_thread] INFO
org.apache.helix.messaging.handling.HelixTask - Delete message
db272de0-949c-4d9f-87dd-f4f912f19a38 from zk!
2023-02-26 07:04:32.600 [HelixTaskExecutor-message_handle_thread] INFO
org.apache.helix.messaging.handling.HelixTaskExecutor - message finished:
db272de0-949c-4d9f-87dd-f4f912f19a38, took 2
```
This is the corresponding code:
https://github.com/apache/pinot/blob/3772b55dc4c35673762a182b2ee650469560aa97/pinot-server/src/main/java/org/apache/pinot/server/starter/helix/HelixInstanceDataManager.java#L277
I was wondering that if we can't find the segment metadata locally can we
fetch it from ZK? Also is there a way where the server can auto-recover from
such a situation?
One of the cases where I have seen this issue happen is when there's a
server restart and an inflight `onBecomeConsumingFromOffline` is killed. When
the server comes back up, I only see that it logs that this segment is in error
in `ServiceStatus`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]