mcvsubbu commented on issue #4484: Pinot query timeout due to the broker 
waiting for a single non-responsive server
URL: 
https://github.com/apache/incubator-pinot/issues/4484#issuecomment-528440340
 
 
   As currently designed, the broker (after pruning segments) comes up with a 
set of segments that need to be covered in order to respond to the query. The 
broker has pre-constructed routing entries that determines which servers need 
to be reached in order to cover these segments. The broker then forwards the 
request and waits for responses, until a timeout. If all servers have not 
responded, then the response is flagged as being partial in the metadata.
   
   There are multiple metrics on servers and brokers (system, jvm and pinot 
level) that adminstrators can set alerts on. Monitoring systems may also 
auto-restart any of these entities on specific alerts. We expect that on a 
large scale site-facing system (that cannot tolerate more than a small number 
of  such partial/failed responses) has such monitoring and automated repair 
systems.
   
   That being said, there is scope for improvement here.  
   
   One simple improvement can be that the request should specify the timeout 
that it can tolerate. The broker waits for a maximum of that timeout, and 
returns partial response. Alerts may be set on the partial response flag to 
indicate that some administrator intervention is needed.
   
   The broker may also support some (fancy) algorithms to back off routing to 
specific servers that have responded late (or not at all), and then include 
them slowly over time, only to back off again if the repair has not been done 
yet. Think of this as a score attached to a server in the routing entries, and 
the score keeps improving as the servers responds faster, but decreases when 
they timeout, or miss a response. Broker tends to favor servers with higher 
score over those with lower ones (perhaps gives smaller number of segments to 
servers with lower score).
   
   Neither of these (or any other than I can think of) auto-fixes a permanent 
problem on the server without some external intervention 
(restarts/resets/hardware replacement/whatever). Therefore, for any system that 
has stringent requirements, it should be that appropriate alerts be set that 
warrant these operations and, if possible, some auto-remediation be applied in 
case of most common causes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to