[jira] [Commented] (SLING-3432) pseudo network partition causes job deserialization issue in a cluster (when reading while job is being reassigned)

Stefan Egli (JIRA) Mon, 13 Apr 2015 08:17:03 -0700

    [ 
https://issues.apache.org/jira/browse/SLING-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492494#comment-14492494
 ]


Stefan Egli commented on SLING-3432:
------------------------------------

Here's one improvement we should follow-up, discussed today with [~cziegeler] 
and [~mmarth] offline: 

* we should avoid the 'isolated mode' completely. It was a wrong thing to do 
this from the beginning. When a node is in 'isolated mode' it is essentially 
cut off and should not take part in any 'leader/topology-dependent' action *at 
all*. 
* any pseudo-network-partitioning consisted of one or a number of nodes that 
were in isolated mode. And it always had to do with the fact that an isolated 
mode created a leader (this was the error taken at the beginning: every 
'cluster' has to have a leader - thus if a node is in an isolated cluster, it 
is automatically leader: *wrong*). And once you let the isolated node become a 
leader you have multiple leaders and the trouble begins
* therefore avoid the isolated mode entirely 

...but instead:

# when any change is detected (which is detected by a timed out heartbeat, by a 
new 'ongoingVoting' of another node (which detected the heartbeat timeout 
first), or even by a 'establishedView' (in case it was not precedented by an 
'ongoingVoting' it means the local node was too slow in reacting for the 
others) - so when any change is detected: immediately fire a TOPOLOGY_CHANGING 
event. 
#* that one [as the docu 
says|https://github.com/apache/sling/blob/trunk/bundles/extensions/discovery/api/src/main/java/org/apache/sling/discovery/TopologyEvent.java]
  {quote}Informs the service about the fact that a state change was detected in 
the topology/cluster and that the new state is in the process of being 
discovered.{quote}
# even if the instance cannot finish a vote/cannot take part of the vote/is 
thrown out of the establishedView by others: it should *not* send a 
TOPOLOGY_CHANGED event *until it successfully was included in an 
establishedView*.
#* between these two events we can now have a large delay, be it minutes or 
even hours - if for some reason the instance is cut off the 
repository/voting-mechanism.

In other words: every component receiving a TOPOLOGY_CHANGING event should 
*back off and not do any leader or topology-dependent action until* a 
TOPOLOGY_CHANGED is received. This should be the recipe to handle 
pseudo-network-partitioning properly.

> pseudo network partition causes job deserialization issue in a cluster (when 
> reading while job is being reassigned)
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: SLING-3432
>                 URL: https://issues.apache.org/jira/browse/SLING-3432
>             Project: Sling
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: Discovery Impl 1.0.2
>            Reporter: Stefan Egli
>
> There is a race condition between two instances in a cluster (eg oak or crx): 
> Instance 1 is writing a job with a binary property, instance 2 is reading the 
> job (likely triggered by discovery sending it a topologychangedevent). It 
> looks like instance 2 is reading the job just about while instance 1 is still 
> in the process or completely writing the job, or at least the binary. 
> Resulting in the following exception:
> 04.03.2014 06:55:39.667 *WARN* [Apache Sling Job Background Loader] 
> org.apache.sling.event.impl.jobs.JobManagerImpl Unable to read job from 
> /var/eventing/jobs/assigned/e4337f8f-47d2-41df-b3ab-0d40b1b2acd4/slingevent:eventadmin/2014/3/3/8/45/cq.wcm.msm.job.pageEvent_9718d7db-85b4-4930-a2ba-11a80d772970_172
> java.lang.Exception: Unable to deserialize property 'pageEvent'
>         at 
> org.apache.sling.event.impl.support.ResourceHelper.cloneValueMap(ResourceHelper.java:213)
>         at 
> org.apache.sling.event.impl.jobs.JobManagerImpl.readJob(JobManagerImpl.java:538)
>         at 
> org.apache.sling.event.impl.jobs.BackgroundLoader.loadJobInTheBackground(BackgroundLoader.java:318)
>         at 
> org.apache.sling.event.impl.jobs.BackgroundLoader.loadJobsInTheBackground(BackgroundLoader.java:294)
>         at 
> org.apache.sling.event.impl.jobs.BackgroundLoader.run(BackgroundLoader.java:203)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.EOFException: null
>         at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2280)
>         at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2749)
>         at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:779)
>         at java.io.ObjectInputStream.<init>(ObjectInputStream.java:279)
>         at 
> org.apache.sling.event.impl.support.ResourceHelper.cloneValueMap(ResourceHelper.java:208)
>         ... 5 common frames omitted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SLING-3432) pseudo network partition causes job deserialization issue in a cluster (when reading while job is being reassigned)

Reply via email to