[akka-user] upgrading production cluster (sharded) system

Peter Wed, 14 Jan 2015 09:45:07 -0800

Hi

I wonder if anyone has experience or thoughts to share about upgrading 
production cluster systems?


I would ideally like to 

   - upgrade the cluster without downtime/scheduled outage
   - not mutate infrastructure, in other words, deploy a new set of nodes 
   with the new version
   - do a staged upgrade, first just a single node taking as little as 
   possible production traffic - a canary in the coal mine


A little bit more about my specific environment

   - cluster runs as a single EC2 autoscale group
   - no akka roles (looking into this as a way to gain independence between 
   functional areas within the application & facilitate independent upgrades - 
   something akin to micro services to use the buzzword du jour) 
   - i don't use akka persistence but each sharded actor is backed by my 
   own distributed persistence mechanism based on DynamoDB
   - there is some tolerance for stale reads but there could be some cases 
   where it's not acceptable

My understanding is that the number of cluster shards should be kept 
constant irrespective of number of cluster nodes, so that the shard 
resolution also remains stable irrespective of number of cluster nodes, as 
in the example in the documentation. It sounds like the bundled rebalancing 
strategy (LeastShardAllocationStrategy) should do the trick when adding the 
first node (canary). I'm wondering if there's any suggestions for doing the 
rest?


   - start all of the remaining new cluster nodes 
      - at what point does rebalancing get kicked off? is there a specific 
      event that triggers a rebalance? is it possible to delay until all the 
new 
      nodes/X nodes has joined/Y time has passed to minimize disruption (single 
      rebalance vs rebalance for every node)
   - wait for period X to ensure rebalancing is complete and all buffered 
   messages during rebalancing has been processed 
      - is it possible to determine this programmatically?
   - stop all the old version nodes 
      - one by one with a period in between or all at once?
      - at this point, messages in flight are lost, need to fall back to 
      clients to retry
   
It gets progressively more hand wavy towards the end as I'm still thinking 
about the details, would love some input & feedback!

Thanks
Peter

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

[akka-user] upgrading production cluster (sharded) system

Reply via email to