ankitsultana opened a new issue, #10342:
URL: https://github.com/apache/pinot/issues/10342

   I have seen this behavior quite often in our systems where a segment would 
go into error state in one of the servers and a reload doesn't fix the issue. 
However if we restart the server then the segment becomes healthy again. On 
checking the logs, I often see something like this:
   
   ```
   INFO  org.apache.helix.messaging.handling.HelixTaskExecutor  - Scheduling 
message db272de0-949c-4d9f-87dd-f4f912f19a38: 
my_table_REALTIME:my_table__79__0__20230224T2228Z, null->null
   2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO  
my_table_REALTIME-SegmentReloadMessageHandler  - Handling message: 
ZnRecord=db272de0-949c-4d9f-87dd-f4f912f19a38, {CREATE_TIMESTAMP=1677395071480, 
EXECUTE_START_TIMESTAMP=1677395072598, 
MSG_ID=db272de0-949c-4d9f-87dd-f4f912f19a38, MSG_STATE=new, 
MSG_SUBTYPE=RELOAD_SEGMENT, MSG_TYPE=USER_DEFINE_MSG, 
PARTITION_NAME=my_table__79__0__20230224T2228Z, 
RESOURCE_NAME=my_table_REALTIME, RETRY_COUNT=0, SRC_CLUSTER=..., 
SRC_INSTANCE_TYPE=PARTICIPANT, SRC_NAME=Controller_.., TGT_NAME=61f.., 
TGT_SESSION_ID=.., TIMEOUT=-1, forceDownload=false}{}{}, Stat=Stat {_version=0, 
_creationTime=1677395072613, _modifiedTime=1677395072613, _ephemeralOwner=0}
   2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO  
my_table_REALTIME-SegmentReloadMessageHandler  - Waiting for lock to refresh : 
my_table__79__0__20230224T2228Z, queue-length: 0
   2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO  
my_table_REALTIME-SegmentReloadMessageHandler  - Acquired lock to refresh 
segment: my_table__79__0__20230224T2228Z (lock-time=0ms, queue-length=0)
   2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO  
o.apache.pinot.server.starter.helix.HelixInstanceDataManager  - Reloading 
single segment: my_table__79__0__20230224T2228Z in table: my_table_REALTIME
   2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO  
o.apache.pinot.server.starter.helix.HelixInstanceDataManager  - Segment 
metadata is null. Skip reloading segment: my_table__79__0__20230224T2228Z in 
table: my_table_REALTIME
   2023-02-26 07:04:32.598 [HelixTaskExecutor-message_handle_thread] INFO  
org.apache.helix.messaging.handling.HelixTask  - Message 
db272de0-949c-4d9f-87dd-f4f912f19a38 completed.
   2023-02-26 07:04:32.600 [HelixTaskExecutor-message_handle_thread] INFO  
org.apache.helix.messaging.handling.HelixTask  - Delete message 
db272de0-949c-4d9f-87dd-f4f912f19a38 from zk!
   2023-02-26 07:04:32.600 [HelixTaskExecutor-message_handle_thread] INFO  
org.apache.helix.messaging.handling.HelixTaskExecutor  - message finished: 
db272de0-949c-4d9f-87dd-f4f912f19a38, took 2
   ```
   
   This is the corresponding code:
   
   
https://github.com/apache/pinot/blob/3772b55dc4c35673762a182b2ee650469560aa97/pinot-server/src/main/java/org/apache/pinot/server/starter/helix/HelixInstanceDataManager.java#L277
   
   I was wondering that if we can't find the segment metadata locally can we 
fetch it from ZK? Also is there a way where the server can auto-recover from 
such a situation?
   
   One of the cases where I have seen this issue happen is when there's a 
server restart and an inflight `onBecomeConsumingFromOffline` is killed. When 
the server comes back up, I only see that it logs that this segment is in error 
in `ServiceStatus`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to