[ 
https://issues.apache.org/jira/browse/OAK-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Yuan updated OAK-11817:
-----------------------------
    Description: 
With a properly configured TarMK cold standby with Jackrabbit Oak based 
solution utilizing Apache Jackrabbit Oak segment-tar cold standby with an 
external BlobStore, the cold standby occasionally creates {*}random missing 
blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak source 
code, it appears that the current logic in _RemoteBlobProcessor.java_ assumes 
that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is not null, 
then the blob is physically present and readable.

However, in real-world scenarios, especially with eventual or out-of-band blob 
synchronization (e.g., via rsync), this assumption can be incorrect. The blob 
may be:
 * Not yet copied
 * Deleted by GC
 * Corrupted or unreadable

This leads to runtime errors when the standby node tries to read missing blobs 
that were assumed present.

+*Proposal:*+
Introduce a new *OSGi configuration property* in _StandbyStoreService_ called 
_verifyBlobFileOnSync_ When enabled:
 * {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from the 
blob to verify it is {*}physically present and readable{*}.

 * If the check fails, the blob is {*}re-fetched from the primary node{*}.

This adds a safeguard against false positives from the reference existing check 
only approach and ensures Cold Standby is more robust in environments with 
non-instantaneous blob synchronization.  It allows administrators to toggle 
strict blob verification behavior depending on their setup (e.g. dev vs 
production).

+*Implementation Plan:*+
 # Add _verifyBlobFileOnSync_ to _StandbyStoreServiceConfiguration_
 # Read this flag in _StandbyStoreService_
 # Pass it to _RemoteBlobProcessor_
 # In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference 
readable if _verifyBlobFileOnSync_ has been specified.

+*Benefits:*+
 * Improves reliability of Cold Standby in environments with delayed or 
out-of-band blob sync

 * Prevents silent corruption or missing blobs

 * Configurable to preserve existing behavior for users who don’t need it

 

Example Configuration:
{noformat}
# 
org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg
verifyBlobFileOnSync=true{noformat}
 

Please feel free to assign this to me as I would be providing a PR shortly.

  was:
With a properly configured TarMK cold standby with Jackrabbit Oak based 
solution utilizing Apache Jackrabbit Oak segment-tar cold standby with an 
external BlobStore, the cold standby occasionally creates {*}random missing 
blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak source 
code, it appears that the current logic in _RemoteBlobProcessor.java_ assumes 
that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is not null, 
then the blob is physically present and readable.

However, in real-world scenarios, especially with eventual or out-of-band blob 
synchronization (e.g., via rsync), this assumption can be incorrect. The blob 
may be:
 * Not yet copied
 * Deleted by GC
 * Corrupted or unreadable

This leads to runtime errors when the standby node tries to read missing blobs 
that were assumed present.

+*Proposal:*+
Introduce a new *OSGi configuration property* in _StandbyStoreService_ called 
_strictBlobVerify_ When enabled:
 * {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from the 
blob to verify it is {*}physically present and readable{*}.

 * If the check fails, the blob is {*}re-fetched from the primary node{*}.

This adds a safeguard against false positives from the reference existing check 
only approach and ensures Cold Standby is more robust in environments with 
non-instantaneous blob synchronization.  It allows administrators to toggle 
strict blob verification behavior depending on their setup (e.g. dev vs 
production).

+*Implementation Plan:*+
 # Add _verifyBlobFileOnSync_ to _StandbyStoreServiceConfiguration_
 # Read this flag in _StandbyStoreService_
 # Pass it to _RemoteBlobProcessor_
 # In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference 
readable if _verifyBlobFileOnSync_ has been specified.

+*Benefits:*+
 * Improves reliability of Cold Standby in environments with delayed or 
out-of-band blob sync

 * Prevents silent corruption or missing blobs

 * Configurable to preserve existing behavior for users who don’t need it

 

Example Configuration:
{noformat}
# 
org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg
verifyBlobFileOnSync=true{noformat}
 

Please feel free to assign this to me as I would be providing a PR shortly.


> Add configurable strict blob verification to RemoteBlobProcessor to prevent 
> missing blob files in Cold Standby
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-11817
>                 URL: https://issues.apache.org/jira/browse/OAK-11817
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: segment-tar
>    Affects Versions: 1.22.22
>         Environment: RHEL 9 + JDK 11+ Apache Sling 1.22
>            Reporter: Scott Yuan
>            Priority: Major
>
> With a properly configured TarMK cold standby with Jackrabbit Oak based 
> solution utilizing Apache Jackrabbit Oak segment-tar cold standby with an 
> external BlobStore, the cold standby occasionally creates {*}random missing 
> blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak 
> source code, it appears that the current logic in _RemoteBlobProcessor.java_ 
> assumes that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is 
> not null, then the blob is physically present and readable.
> However, in real-world scenarios, especially with eventual or out-of-band 
> blob synchronization (e.g., via rsync), this assumption can be incorrect. The 
> blob may be:
>  * Not yet copied
>  * Deleted by GC
>  * Corrupted or unreadable
> This leads to runtime errors when the standby node tries to read missing 
> blobs that were assumed present.
> +*Proposal:*+
> Introduce a new *OSGi configuration property* in _StandbyStoreService_ called 
> _verifyBlobFileOnSync_ When enabled:
>  * {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from 
> the blob to verify it is {*}physically present and readable{*}.
>  * If the check fails, the blob is {*}re-fetched from the primary node{*}.
> This adds a safeguard against false positives from the reference existing 
> check only approach and ensures Cold Standby is more robust in environments 
> with non-instantaneous blob synchronization.  It allows administrators to 
> toggle strict blob verification behavior depending on their setup (e.g. dev 
> vs production).
> +*Implementation Plan:*+
>  # Add _verifyBlobFileOnSync_ to _StandbyStoreServiceConfiguration_
>  # Read this flag in _StandbyStoreService_
>  # Pass it to _RemoteBlobProcessor_
>  # In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference 
> readable if _verifyBlobFileOnSync_ has been specified.
> +*Benefits:*+
>  * Improves reliability of Cold Standby in environments with delayed or 
> out-of-band blob sync
>  * Prevents silent corruption or missing blobs
>  * Configurable to preserve existing behavior for users who don’t need it
>  
> Example Configuration:
> {noformat}
> # 
> org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg
> verifyBlobFileOnSync=true{noformat}
>  
> Please feel free to assign this to me as I would be providing a PR shortly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to