[jira] [Commented] (SLING-3432) job deserialization issue in a cluster when reading while/before job is fully written

Stefan Egli (JIRA) Wed, 05 Mar 2014 02:36:07 -0800

    [ 
https://issues.apache.org/jira/browse/SLING-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920726#comment-13920726
 ]


Stefan Egli commented on SLING-3432:
------------------------------------

Here's the likely order of events that caused this, this time:

 * instance 1 (9718d7db) is leader in a 4 node cluster
 * for some reason instance 1 runs into a high CPU load situation and cannot 
issue heartbeats 'in time' (ie within >30sec), or the repository 
synchronization is slow to propagate those heartbeats to the rest of the cluster
 * after the other 3 nodes don't "hear" from instance 1 for over 60sec anymore, 
they decide it has died and establish a new discovery view (marking instance 3 
(2ccd5d38) the new leader)
 * instance 3 - the new leader - now runs the job handler's BackgroundLoader, 
realizes that instance 1 is not in the cluster anymore (from its point of view) 
and reassigns above job to instance 2 (e4337f8f) along a bunch of other jobs it 
reassigns
 * instance 2 at some point - also via the BackgroundLoader - grabs the list of 
jobs it has assigned
 * in this test case the list of jobs is rather large, which is results in the 
following WARN:
{code}
04.03.2014 06:55:39.343 *WARN* [Apache Sling Job Background Loader] 
org.apache.jackrabbit.oak.plugins.index.property.strategy.ContentMirrorStoreStrategy
 Traversed 1000 nodes using index jcr:primaryType with filter 
Filter(query=select [jcr:path], [jcr:score], * from [slingevent:Job] as a where 
[event.job.application] = 'e4337f8f-47d2-41df-b3ab-0d40b1b2acd4' and 
[slingevent:created] < cast('2014-03-04T06:55:09.143-05:00' as date) order by 
[slingevent:created] /* xpath: 
//element(*,slingevent:Job)[@event.job.application = 
'e4337f8f-47d2-41df-b3ab-0d40b1b2acd4' and @slingevent:created < 
xs:dateTime('2014-03-04T06:55:09.143-05:00')] order by @slingevent:created 
ascending */, path=*, 
property=[event.job.application=e4337f8f-47d2-41df-b3ab-0d40b1b2acd4, 
slingevent:created=..2014-03-04T06:55:09.143-05:00)])
{code}
 * which is why instance 2 takes a while when reading its jobs until it reaches 
above job 
(slingevent:eventadmin/2014/3/3/8/45/cq.wcm.msm.job.pageEvent_9718d7db-85b4-4930-a2ba-11a80d772970_172)
 * meanwhile, instance 1 thinks that it is all alone. It gets told so via the 
discovery API via a topology changed event which only contains itself - plus it 
is the leader (someone has to be)
 * as instance 1 got this topology changed (informing it about the other 3 
leaving the cluster), it has to go through the BackgroundLoader to verify 
whether the jobs are all correctly assigned
 * this races with instance 3 reassigning jobs away from instance 1
 * at some point instance 2 is reading mentioned job which instance 1 thinks: 
oh, that's assigned to a crashed instance, let me reassign it to myself. This 
reassignment creates the job under instance 1's path and deletes it under 
instance 2's path. At which point - when instance 2 wants to read the binary - 
fails with a NullPointerException

> job deserialization issue in a cluster when reading while/before job is fully 
> written
> -------------------------------------------------------------------------------------
>
>                 Key: SLING-3432
>                 URL: https://issues.apache.org/jira/browse/SLING-3432
>             Project: Sling
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: Event 3.3.4
>            Reporter: Stefan Egli
>
> There is a race condition between two instances in a cluster (eg oak or crx): 
> Instance 1 is writing a job with a binary property, instance 2 is reading the 
> job (likely triggered by discovery sending it a topologychangedevent). It 
> looks like instance 2 is reading the job just about while instance 1 is still 
> in the process or completely writing the job, or at least the binary. 
> Resulting in the following exception:
> 04.03.2014 06:55:39.667 *WARN* [Apache Sling Job Background Loader] 
> org.apache.sling.event.impl.jobs.JobManagerImpl Unable to read job from 
> /var/eventing/jobs/assigned/e4337f8f-47d2-41df-b3ab-0d40b1b2acd4/slingevent:eventadmin/2014/3/3/8/45/cq.wcm.msm.job.pageEvent_9718d7db-85b4-4930-a2ba-11a80d772970_172
> java.lang.Exception: Unable to deserialize property 'pageEvent'
>         at 
> org.apache.sling.event.impl.support.ResourceHelper.cloneValueMap(ResourceHelper.java:213)
>         at 
> org.apache.sling.event.impl.jobs.JobManagerImpl.readJob(JobManagerImpl.java:538)
>         at 
> org.apache.sling.event.impl.jobs.BackgroundLoader.loadJobInTheBackground(BackgroundLoader.java:318)
>         at 
> org.apache.sling.event.impl.jobs.BackgroundLoader.loadJobsInTheBackground(BackgroundLoader.java:294)
>         at 
> org.apache.sling.event.impl.jobs.BackgroundLoader.run(BackgroundLoader.java:203)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.EOFException: null
>         at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2280)
>         at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2749)
>         at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:779)
>         at java.io.ObjectInputStream.<init>(ObjectInputStream.java:279)
>         at 
> org.apache.sling.event.impl.support.ResourceHelper.cloneValueMap(ResourceHelper.java:208)
>         ... 5 common frames omitted



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SLING-3432) job deserialization issue in a cluster when reading while/before job is fully written

Reply via email to