[ https://issues.apache.org/jira/browse/OAK-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Scott Yuan updated OAK-11817: ----------------------------- Description: With a properly configured TarMK cold standby with Jackrabbit Oak based solution utilizing Apache Jackrabbit Oak segment-tar cold standby with an external BlobStore, the cold standby occasionally creates {*}random missing blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak source code, it appears that the current logic in _RemoteBlobProcessor.java_ assumes that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is not null, then the blob is physically present and readable. However, in real-world scenarios, especially with eventual or out-of-band blob synchronization (e.g., via rsync), this assumption can be incorrect. The blob may be: * Not yet copied * Deleted by GC * Corrupted or unreadable This leads to runtime errors when the standby node tries to read missing blobs that were assumed present. +*Proposal:*+ Introduce a new *OSGi configuration property* in _StandbyStoreService_ called _verifyBlobFileOnSync_ When enabled: * {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from the blob to verify it is {*}physically present and readable{*}. * If the check fails, the blob is {*}re-fetched from the primary node{*}. This adds a safeguard against false positives from the reference existing check only approach and ensures Cold Standby is more robust in environments with non-instantaneous blob synchronization. It allows administrators to toggle strict blob verification behavior depending on their setup (e.g. dev vs production). +*Implementation Plan:*+ # Add _verifyBlobFileOnSync_ to _StandbyStoreServiceConfiguration_ # Read this flag in _StandbyStoreService_ # Pass it to _RemoteBlobProcessor_ # In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference readable if _verifyBlobFileOnSync_ has been specified. +*Benefits:*+ * Improves reliability of Cold Standby in environments with delayed or out-of-band blob sync * Prevents silent corruption or missing blobs * Configurable to preserve existing behavior for users who don’t need it Example Configuration: {noformat} # org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg verifyBlobFileOnSync=true{noformat} Please feel free to assign this to me as I would be providing a PR shortly. was: With a properly configured TarMK cold standby with Jackrabbit Oak based solution utilizing Apache Jackrabbit Oak segment-tar cold standby with an external BlobStore, the cold standby occasionally creates {*}random missing blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak source code, it appears that the current logic in _RemoteBlobProcessor.java_ assumes that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is not null, then the blob is physically present and readable. However, in real-world scenarios, especially with eventual or out-of-band blob synchronization (e.g., via rsync), this assumption can be incorrect. The blob may be: * Not yet copied * Deleted by GC * Corrupted or unreadable This leads to runtime errors when the standby node tries to read missing blobs that were assumed present. +*Proposal:*+ Introduce a new *OSGi configuration property* in _StandbyStoreService_ called _strictBlobVerify_ When enabled: * {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from the blob to verify it is {*}physically present and readable{*}. * If the check fails, the blob is {*}re-fetched from the primary node{*}. This adds a safeguard against false positives from the reference existing check only approach and ensures Cold Standby is more robust in environments with non-instantaneous blob synchronization. It allows administrators to toggle strict blob verification behavior depending on their setup (e.g. dev vs production). +*Implementation Plan:*+ # Add _verifyBlobFileOnSync_ to _StandbyStoreServiceConfiguration_ # Read this flag in _StandbyStoreService_ # Pass it to _RemoteBlobProcessor_ # In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference readable if _verifyBlobFileOnSync_ has been specified. +*Benefits:*+ * Improves reliability of Cold Standby in environments with delayed or out-of-band blob sync * Prevents silent corruption or missing blobs * Configurable to preserve existing behavior for users who don’t need it Example Configuration: {noformat} # org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg verifyBlobFileOnSync=true{noformat} Please feel free to assign this to me as I would be providing a PR shortly. > Add configurable strict blob verification to RemoteBlobProcessor to prevent > missing blob files in Cold Standby > -------------------------------------------------------------------------------------------------------------- > > Key: OAK-11817 > URL: https://issues.apache.org/jira/browse/OAK-11817 > Project: Jackrabbit Oak > Issue Type: Bug > Components: segment-tar > Affects Versions: 1.22.22 > Environment: RHEL 9 + JDK 11+ Apache Sling 1.22 > Reporter: Scott Yuan > Priority: Major > > With a properly configured TarMK cold standby with Jackrabbit Oak based > solution utilizing Apache Jackrabbit Oak segment-tar cold standby with an > external BlobStore, the cold standby occasionally creates {*}random missing > blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak > source code, it appears that the current logic in _RemoteBlobProcessor.java_ > assumes that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is > not null, then the blob is physically present and readable. > However, in real-world scenarios, especially with eventual or out-of-band > blob synchronization (e.g., via rsync), this assumption can be incorrect. The > blob may be: > * Not yet copied > * Deleted by GC > * Corrupted or unreadable > This leads to runtime errors when the standby node tries to read missing > blobs that were assumed present. > +*Proposal:*+ > Introduce a new *OSGi configuration property* in _StandbyStoreService_ called > _verifyBlobFileOnSync_ When enabled: > * {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from > the blob to verify it is {*}physically present and readable{*}. > * If the check fails, the blob is {*}re-fetched from the primary node{*}. > This adds a safeguard against false positives from the reference existing > check only approach and ensures Cold Standby is more robust in environments > with non-instantaneous blob synchronization. It allows administrators to > toggle strict blob verification behavior depending on their setup (e.g. dev > vs production). > +*Implementation Plan:*+ > # Add _verifyBlobFileOnSync_ to _StandbyStoreServiceConfiguration_ > # Read this flag in _StandbyStoreService_ > # Pass it to _RemoteBlobProcessor_ > # In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference > readable if _verifyBlobFileOnSync_ has been specified. > +*Benefits:*+ > * Improves reliability of Cold Standby in environments with delayed or > out-of-band blob sync > * Prevents silent corruption or missing blobs > * Configurable to preserve existing behavior for users who don’t need it > > Example Configuration: > {noformat} > # > org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg > verifyBlobFileOnSync=true{noformat} > > Please feel free to assign this to me as I would be providing a PR shortly. -- This message was sent by Atlassian Jira (v8.20.10#820010)