[I] Pinot graceful node replacement for large scale production usage [pinot]

via GitHub Tue, 03 Dec 2024 18:12:01 -0800


lnbest0707-uber opened a new issue, #14592:
URL: https://github.com/apache/pinot/issues/14592


   On the real world cloud based stateful platform, host underlying the Pinot 
container would run in dynamic status. Host/Node replacement is very frequent. 
Such operation ideally should be fully transparent to users even without Pinot 
admins' awareness.
   However, Pinot nowadays, does not have a really graceful (enough) way to 
handle the node replacement. Though it is usually with multiple replicas, 
running in a under replication status would make the system stressful and 
risky. For example, if the table is with 2 replicas, during node replacement, 
we have to experience:
   
   - Many segments are only under 1 replica, the query load on it would go 
double.
   - For segments running with 1 replica, we are experiencing a very high risk 
that the data might lose or query might fail if another node goes down due to 
hardware or network issues.
   
   Though we would experience same issue during node restart, node replacement 
is far slower than node restart especially with high data volume. For example, 
for a large node with 5+TB data, the entire single node replacement might take 
5+ hours to complete. This is far longer than the node restart which might only 
take minutes. The longer the node replacement is, the longer node downtime we 
have to endure, the higher the risk is introduced.
   
   Hence, **reducing node replacement downtime** is crucial for smooth large 
scale production maintenance.
   
   During the downtime, we would observe
   
   - Helix pending messages slowly decrease to 0
   - SEGMENTS_WITH_LESS_REPLICAS (introduced in 
https://github.com/apache/pinot/pull/12336) slowly decrease
   
   The speed is far slower than a node restart because the node has to download 
the missing segment data from either deep store or peers before loading them 
into memory.
   Therefore, a straightforward and effective way to reduce the downtime is 
that, before bringing down the old node (ON), we'd better **pre-download all 
required segments on the new node** (NN). Afterwards, bringing down the ON and 
starting up the NN would be same as the lightweight node restart.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Pinot graceful node replacement for large scale production usage [pinot]

Reply via email to