On 11.12.2014 15:20, Simo Sorce wrote:
> On Thu, 11 Dec 2014 14:18:36 +0100
> Ludwig Krispenz <lkris...@redhat.com> wrote:
>> On 12/05/2014 04:50 PM, Simo Sorce wrote:
>>> On Thu, 04 Dec 2014 14:33:09 +0100
>>> Ludwig Krispenz <lkris...@redhat.com> wrote:
>>>> I just have another (hopefully this will end soon) issue I want to
>>>> get your input. (please read to teh end first)
>>>> To recapture the conditions:
>>>> - the topology plugin manages the connections between servers as
>>>> segments in the shared tree
>>>> - it is authoritative for managed servers, eg it controls all
>>>> connections between servers listed under cn=masters,
>>>> it is permissive for connection to other servers
>>>> - it rejects any removal of a segment, which would disconnect the
>>>> - a change in topology can be applied to any server in the
>>>> topology, it will reach the respective servers and the plugin will
>>>> act upon it
>>>> Now there is a special case, causing a bit of trouble. If a replica
>>>> is to be removed from the topology, this means that
>>>> the replication agreements from and to this replica should be
>>>> removed, the server should be removed from the manages servers.
>>>> The problem is that:
>>>> - if you remove the server first, the server becomes unmanaged and
>>>> removal of the segment will not trigger a removal of the
>>>> replication agreement
>>> Can you explain what you mean "if you remove the server first"
>>> exactly ? What LDAP operation will be performed, by the management
>>> tools ?
>> as far as the plugin is concerned a removal of a replica triggers two
>> - removal of the host from the sservers in cn=masters, so the server
>> is no longer considered as managed
>> - removal of the segment(s) connecting the to be removed replica to
>> other still amnaged servers, which should remove the corresponding
>> replication agreements.
>> It was the order of these two operations I was talking
> We can define a correct order, the plugin can refuse to do any other
> order for direct operations (we need to be careful not to refuse
> replication operations I think).
>>>> - if you remove the segments first, one segment will be the last
>>>> one connecting this replica to the topology and removal will be
>>> We should never remove the segments first indeed.
>> if we can fully control that only specific management tools can be
>> used, we can define the order, but an admin could apply individual
>> operations and still it would be good if nothing breaks
> I think we had a plan to return UNWILLING_TO_PERFORM if the admin tries
> to remove the last segment first. So we would have no problem really,
> the admin can try and fail. If he wants to remove a master he'll have
> to remove it from the masters group, and this will trigger the removal
> of all segments.
>>>> Now, with some effort this can be resolved, eg
>>>> if the server is removed, keep it internally as removed server and
>>>> for segments connecting this server trigger removal of replication
>>>> agreements or mark a the last segment, when tried to remove, as
>>>> pending and once the server is removed also remove the
>>>> corresponding repl agreements
>>> Why should we "keep it internally" ?
>>> If you mark the agreements as managed by setting an attribute on
>>> them, then you will never have any issue recognizing a "managed"
>>> agreement in cn=config, and you will also immediately find out it
>>> is "old" as it is not backed by a segment so you will safely remove
>> I didn't want to add new flags/fields to the replication agreements
>> as long as anything can be handled by the data in the shared tree.
> We have too. I think it is a must or we will find numerous corner cases.
> Is there a specific reason why you do not want to add flags to
> replication agreements in cn=config ?
>> "internally" was probably misleading, but I will think about it again
> Ok, it is important we both understand what issues we see with any of
> the possible approaches so we can agree on the best one.
>>> Segments (and their agreements) should be removed as trigger on the
>>> master entry getting removed. This should be done even if it causes
>>> a split brain, because if the server is removed, no matter how much
>>> we wish to keep tropology integrity we effectively are in a split
>>> brain situation, keeping toplogy agreements alive w/o the server
>>> entry doesn't help.
>> If we can agree on that, that presence/removal of masters is the
>> primary trigger that's fine.
> Yes I think we can definitely agree that this is the primary trigger
> for server removal/addition.
>> I was thinking of situations where a server was removed,
>> but not uninstalled.
> Understood, but even then it makes no real difference, once the server
> is removed from the group of masters it will not be able to replicate
> outbound anymore as the other master's ACIs will not recognize this
> server credentials as valid replicator creds.
>> Just taking it out of the topology, but it could still be reached
> It can be reached, and that may be a problem for clients. But in the
> long term this should be true only for clients manually configured to
> reach that server. Clients that use SRV records would see it drop off,
> and switch to another one.
> We may consider whether we want some automatism that causes the server
> to shut itself down if it can't replicate (or receives replication data
> to the effect it realizes it is out of the topology). But this may be a
> little too drastic.
>>>> But there is a problem, which I think is much harder and I am not
>>>> sure how much effort I should put in resolving it.
>>>> If we want to have the replication agreements cleaned up after
>>>> removal of a replica without direct modification of cn=config, we
>>>> need to follow the path above,
>>>> but this also means that the last change needs to reach both the
>>>> removed replica (R) and the last server(S) it is connected to.
>>> It would be nice if the changed reached the replica, indeed, but
>>> not a big deal if it doesn't, if you are removing the replica it
>>> means you are decommissioning it, so it is not really that
>>> important that it receives updates, it will be destroyed shortly.
>> That's what I was not sure about, couldn't there be cases where it is
>> not destroyed, just isolated.
> Why would you isolate a server ? Is there a legitimate case an admin
> would want to do that ?
I know about one use case: Upgrade testing.
- Install new replica.
- Connect it to existing topology and suck in all the data.
- Disconnect the new replica from rest of topology.
- Do upgrade experiments.
- Destroy the new/'experimental' replica.
>>> And if it is not destroyed for whatever reason, it will be removed
>>> from the masters group anyway so it will have no permission to
>>> replicate back, and no harm is done to the overall domain.
>>>> The bad thing is that if this change triggers a
>>>> removal of the replication agreement on S it could be that the
>>>> change is not replicated to R before the agreement is removed and
>>>> is lost. There is no way (or no easy) way to know for teh plugin
>>>> if a change was received by an other server,
>>> There is an easy way, contact the other server and see if the change
>>> happened in its LDAP tree :)
>>> BNut this is not really necessary, as explained above.
>>>> I was also thinking about some kind
>>>> of acknowledge mechanism by doing a ping pong of changes, but the
>>>> problem always is the same that one server does not know if the
>>>> other has received it.
>>>> And even if this would theoretically work, we cannot be sure that R
>>>> is not shutdown and only the remaining topology is tried to be
>>>> cleaned up, so S would wait forever.
>>> We should not care, if you are deleting a replica it doesn't matter
>>> what's on the replica side IMO.
>>>> My suggestion to resolve this (in most cases) is to define a wait
>>>> interval, after the final combination of removal of a server and
>>>> its connecting segment is received, wait for some time and then
>>>> remove the corresponding replication agreements.
>>> Why ?
>>>> So I'm asking you if this would be acceptable or if you have a
>>>> better solution.
>>> I am trying to understand why we have a problem, actually, I do not
>>> really see one, why do you think it is important to update a replica
>>> that is being killed ?
>> because I had scenarios in mind where it would not be killed, just
>> removed from the topology
> Ok, but I do not see what it would be a legitimate action to cause a
> server to get out. But even if that happens the server won't be able to
> replicate back to the domain until the admin takes the step of putting
> the server back into the masters group (causing replication to be
> restored both ways), so I see no harm.
Freeipa-devel mailing list