Re: [akka-user] ShardCoordinator/ShardRegion retryInterval

Patrik Nordwall Wed, 06 Aug 2014 00:28:54 -0700

Reading the start of this thread again, ignoring the proposed solution.
When B crashes it will take some time for the failure detector to notice
and the cluster to remove the node. During this time messages will still be
routed to B, i.e. lost. Then, when the coordinator is started on A it takes
some additional time for it be aware of that B is terminated. This means
that messages will be lost in the case of a crash no matter if we tighten
something in the coordinator startup.


Cluster sharding is not designed to prevent message loss. For re-balancing
and passivation it tries hard to not drop messages, but in case of crash
that is out of scope for cluster sharding. Therefore you need to use some
other mechanism on top to ensure that your messages are delivered, e.g.
AtLeastOnceDelivery in Akka Persistence.

/Patrik


On Tue, Aug 5, 2014 at 7:07 PM, Patrik Nordwall <[email protected]>
wrote:

>
>
>
> On Aug 5, 2014, at 16:32, delasoul <[email protected]> wrote:
>
> It is then a trade-off between loosing messages and sharding with
> incomplete information(for a certain period).
> If the Coordinator moves, the ShardRegion on C has to reregister with the
> new Coordinator on A.
> If the filter in AfterRecover removed the ShardRegion from C(because it is
> not in the current clusterstate of A yet) it will be added and watched
> again when C reregisters.
> In the meantime shards will only get allocated on A, but no messages are
> lost and the sharding imbalance is fixed by a later rebalance...
>
>
> A shard MUST only live at one place at a time. In my example it lives on
> C, but is declared terminated based on wrong info by the coordinator and
> can thereby be allocated at some other node as well.
>
>
> An alternative to filtering the Coordinator state would be to not call
> requestShardBufferHomes when the ShardRegion receives the RegisterAck from
> the Coordinator, so that setting the retryInterval to a higher period would
> have an effect, but I don't know if this would be a better trade-off?
>
> I understand your concern, but I have to think more about how serious it
> is and how to improve it. That will take some time, so I suggest that you
> create an issue https://github.com/akka/akka
> /Patrik
>
>
> thanks,
>
> michael
>
>
> On Tuesday, 5 August 2014 14:15:47 UTC+2, Patrik Nordwall wrote:
>>
>> It is not possible to base this on currentMembers (or cluster membership
>> events). Let me describe a counter example.
>>
>> A, and B, as you describe. Node C is added and B allocates a shard to C.
>> B crash, and A starts new coordinator. At this point it is not guaranteed
>> that A has seen that node C was added, i.e. it should try to use it (watch
>> it) anyway.
>>
>> /Patrik
>>
>>
>> On Tue, Jul 29, 2014 at 1:56 PM, delasoul <[email protected]> wrote:
>>
>>> Hello Patrik,
>>>
>>> thank you for looking into this.
>>> But I think I was wrong about the retry interval:
>>> The Retry messageHandler first checks if the Coordinator is Some only
>>> then requestShardBufferHomes is called, therefore
>>> the remaining ShardRegion on Node A first has to receive a MemberRemoved
>>> ClusterEvent to update the oldest member information to send the Register
>>> msg to the correct cluster node.
>>> When the ShardRegion receives the RegisterAck from the Coordinator,
>>> requestShardBufferHomes is called - so a longer retry interval will have
>>> no effect. (hope I got it right this time:)
>>> So to fix this,  I think the Coordinator's state has to be updated when
>>> it gets or after it was replayed (and it would decrease the cases when msgs
>>> can get lost by one:)
>>>
>>> thanks again,
>>>
>>> michael
>>>
>>>
>>>
>>> On Tuesday, 29 July 2014 11:36:04 UTC+2, Patrik Nordwall wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> I get your point. I'm not sure your workaround is correct for all
>>>> scenarios. I will follow up on this next week. Perhaps we can improve this,
>>>> but there are other cases when messages will be lost, so reliable delivery
>>>> must anyway be added on top when it is needed.
>>>>
>>>> /Patrik
>>>>
>>>> 29 jul 2014 kl. 10:24 skrev delasoul <[email protected]>:
>>>>
>>>> Hello,
>>>>
>>>> we have for example a 2 node cluster, the ShardCoordinator runs on node
>>>> B , sharded actors run on node A and node B.
>>>> When node B goes down the coordinator is started on node A and gets
>>>> replayed, which means that its state still hold outdated information about
>>>> the ShardRegion(with its ShardIds) of  removed Node B.
>>>> In the meantime the ShardRegion on node A buffers all incoming messages
>>>> until it is able to reregister with the new Coordinator. Then it requests
>>>> the ShardHomes
>>>> for the ShardIds formerly housed on Node B but as the Coordinator still
>>>> has the information that these ShardIds should live on Node B it returns
>>>> this information and the ShardRegion on Node A forwards the messages to
>>>> Node B which will fail...
>>>> When the Coordinator finishes replaying it watches all ShardRegions,
>>>> including the removed ShardRegion on Node B, hence it will get a Terminated
>>>> message for this
>>>> actor - but unfortunately that is received too late if the
>>>> ShardRegion's retry interval is set too low.
>>>> To make this more "predictable" we  added a filter in the
>>>> ShardCoodinator's AfterRecover message handler:
>>>>
>>>> case AfterRecover ⇒
>>>>       val currentMembers = Cluster(context.system).state.members
>>>>       persistentState.regions.foreach { case (a, _) ⇒
>>>>         if(currentMembers.exists(_.address == a.path.address))
>>>>           context.watch(a)
>>>>         else {
>>>>           persist(ShardRegionTerminated(a)) { evt ⇒
>>>>             persistentState = persistentState.updated(evt)
>>>>           }
>>>>         }
>>>>       }
>>>>
>>>>
>>>> Could this (or a better solution) be added to the ShardCoordinator?
>>>>
>>>> michael
>>>>
>>>>  --
>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>>>> urrent/additional/faq.html
>>>> >>>>>>>>>> Search the archives: https://groups.google.com/grou
>>>> p/akka-user
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Akka User List" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>>
>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>  --
>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/
>>> current/additional/faq.html
>>> >>>>>>>>>> Search the archives: https://groups.google.com/
>>> group/akka-user
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>>
>> Patrik Nordwall
>> Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
>> Twitter: @patriknw
>>
>>   --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>
>


-- 

Patrik Nordwall
Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
Twitter: @patriknw

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] ShardCoordinator/ShardRegion retryInterval

Reply via email to