dang-stripe opened a new issue, #10902: URL: https://github.com/apache/pinot/issues/10902
due to https://github.com/apache/pinot/issues/10900, we've noticed failed queries when segments go into ERROR state. we think the following happens for a realtime segment: - replica 1 for segment goes from OFFLINE -> ERROR state immediately due to failing to create kafka consumer - brokers update routing tables w/ entry for this segment when the callback is received and treat segment as unavailable, any queries routed at this time will fail - replica 2 for segment successfully goes from OFFLINE -> CONSUMING - brokers update routing table with healthy server desired behavior: - brokers give some buffer time before routing to segments with only 1 ERROR segment so the other segment could potentially succeed, limiting impact window of failed queries - if all replicas are in ERROR state, treat segment as unavailable immediately example logs ``` # server 1 processes state transition for test_table_segment__50__1234__20230609T2053Z [2023-06-09 20:53:42.415851] INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_39:25] handling message: b70d80ca-0a44-493d-9bcd-08b5b9d4f4d2 transit TEST_TABLE.test_table_segment__50__1234__20230609T2053Z|[] from:OFFLINE to:CONSUMING, relayedFrom: null # server 2 processes state transition for test_table_segment__50__1234__20230609T2053Z [2023-06-09 20:53:42.421356] INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_10:25] handling message: 168541ec-e639-4203-9d24-ab4f7c2c5cd1 transit TEST_TABLE.test_table_segment__50__1234__20230609T2053Z|[] from:OFFLINE to:CONSUMING, relayedFrom: null # server 1 fails state transition due to transient networking issue [2023-06-09 20:53:42.441666] INFO [HelixTask] [HelixTaskExecutor-message_handle_thread_39:25] Message: b70d80ca-0a44-493d-9bcd-08b5b9d4f4d2 (parent: null) handling task for TEST_TABLE:test_table_segment__50__1234__20230609T2053Z completed at: 1686344022441, results: false. FrameworkTime: 1 ms; HandlerTime: 25 ms. # server 2 succeeds state transition ~8 seconds later [2023-06-09 20:53:50.153309] INFO [HelixTask] [HelixTaskExecutor-message_handle_thread_10:25] Message: 168541ec-e639-4203-9d24-ab4f7c2c5cd1 (parent: null) handling task for TEST_TABLE:test_table_segment__50__1234__20230609T2053Z completed at: 1686344030153, results: true. FrameworkTime: 3 ms; HandlerTime: 7729 ms. # brokers process routing table update for server 1's ERROR segment transition and treat segment as unavailable [2023-06-09 20:53:55.120647] WARN [BaseInstanceSelector] [ClusterChangeHandlingThread:25] Failed to find servers hosting segment: test_table_segment__50__1234__20230609T2053Z for table: TEST_TABLE (all candidate instances: [] are disabled, counting segment as unavailable) [2023-06-09 20:53:55.122153] INFO [BrokerRoutingManager] [ClusterChangeHandlingThread:25] Processed segment assignment change in 84ms (fetch ideal state and external view stats for 4 tables: 2ms, update routing entry for 1 tables ([TEST_TABLE]): 82ms) [2023-06-09 20:53:55.120647] WARN [BaseInstanceSelector] [ClusterChangeHandlingThread:25] Failed to find servers hosting segment: test_table_segment__50__1234__20230609T2053Z for table: TEST_TABLE (all candidate instances: [] are disabled, counting segment as unavailable) # observe query failures around 20:53:55 # brokers process routing table update for server 2's segment [2023-06-09 20:54:08.084514] INFO [BrokerRoutingManager] [ClusterChangeHandlingThread:25] Processed segment assignment change in 82ms (fetch ideal state and external view stats for 4 tables: 3ms, update routing entry for 1 tables ([point_transaction_REALTIME]): 79ms) ``` cc @Jackie-Jiang @navina @jadami10 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
