[ 
https://issues.apache.org/jira/browse/OAK-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Dulceanu updated OAK-6678:
---------------------------------
    Attachment: OAK-6678.patch

The behaviour described above was due to TarMK Flush thread not kicking in 
before the actual sync started between the standby and primary. Here are the 
key improvements from the patch attached:
* make client more resilient to errors by only logging the error when persisted 
remote head is not (yet) available
* make server more resilient to same situation by employing a "read persisted 
head with retry" logic in {{DefaultStandbyHeadReader}}, as already available 
for reading segments
* add unit test in {{DefaultStandbyHeadReaderTest}} to verify "read persisted 
head with retry" logic 
* added {{DataStoreTestBase#testResilientSync}} in which I tried to reproduce 
the situation in the description of the issue. With the above improvements, the 
sync can finally happen (in the second run) and overall cold standby proves to 
be more resilient.

[~frm], could you take a look at the patch, please?

> Syncing big blobs fails since StandbyServer sends persisted head
> ----------------------------------------------------------------
>
>                 Key: OAK-6678
>                 URL: https://issues.apache.org/jira/browse/OAK-6678
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: segment-tar, tarmk-standby
>            Reporter: Andrei Dulceanu
>            Assignee: Andrei Dulceanu
>              Labels: cold-standby, resilience
>             Fix For: 1.7.8
>
>         Attachments: OAK-6678.patch
>
>
> With changes for OAK-6653 in place, 
> {{ExternalPrivateStoreIT#testSyncBigBlog}} and sometimes 
> {{ExternalSharedStoreIT#testSyncBigBlob}} are failing on CI:
> {noformat}
> org.apache.jackrabbit.oak.segment.standby.ExternalSharedStoreIT
> testSyncBigBlob(org.apache.jackrabbit.oak.segment.standby.ExternalSharedStoreIT)
>   Time elapsed: 96.82 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<{ root = { ... } }> but was:<{ root : { } 
> }>
> ...
> testSyncBigBlob(org.apache.jackrabbit.oak.segment.standby.ExternalPrivateStoreIT)
>   Time elapsed: 95.254 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<{ root = { ... } }> but was:<{ root : { } 
> }>
> {noformat}
> Partial stacktrace:
> {noformat}
> 14:09:08.355 DEBUG [main] StandbyServer.java:242            Binding was 
> successful
> 14:09:08.358 DEBUG [standby-1] GetHeadRequestEncoder.java:33 Sending request 
> from client Bar for current head
> 14:09:08.359 DEBUG [primary-1] ClientFilterHandler.java:53  Client 
> /127.0.0.1:52988 is allowed
> 14:09:08.360 DEBUG [primary-1] RequestDecoder.java:42       Parsed 'get head' 
> message
> 14:09:08.360 DEBUG [primary-1] CommunicationObserver.java:79 Message 'get 
> head' received from client Bar
> 14:09:08.362 DEBUG [primary-1] GetHeadRequestHandler.java:43 Reading head for 
> client Bar
> 14:09:08.363 WARN  [primary-1] ExceptionHandler.java:31     Exception caught 
> on the server
> java.lang.NullPointerException: null
>       at 
> org.apache.jackrabbit.oak.segment.standby.server.DefaultStandbyHeadReader.readHeadRecordId(DefaultStandbyHeadReader.java:32)
>  ~[oak-segment-tar-1.8-SNAPSHOT.jar:1.8-SNAPSHOT]
>       at 
> org.apache.jackrabbit.oak.segment.standby.server.GetHeadRequestHandler.channelRead0(GetHeadRequestHandler.java:45)
>  ~[oak-segment-tar-1.8-SNAPSHOT.jar:1.8-SNAPSHOT]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to