Scott Yuan created OAK-11817: -------------------------------- Summary: Add configurable strict blob verification to RemoteBlobProcessor to prevent missing blob files in Cold Standby Key: OAK-11817 URL: https://issues.apache.org/jira/browse/OAK-11817 Project: Jackrabbit Oak Issue Type: Bug Components: segment-tar Affects Versions: 1.22.22 Environment: RHEL 9 + JDK 11+ Apache Sling 1.22 Reporter: Scott Yuan
With a properly configured TarMK cold standby (e.g., Adobe Experience Manager 6.5.23) utilizing Apache Jackrabbit Oak segment-tar cold standby with an external BlobStore, the cold standby occasionally creates {*}random missing blobs under heavy load{*}. After investigating the Apache Jackrabbit Oak source code, it appears that the current logic in _RemoteBlobProcessor.java_ assumes that if a _SegmentBlob_ has a _blobId_ and _blob.getReference()_ is not null, then the blob is physically present and readable. However, in real-world scenarios, especially with eventual or out-of-band blob synchronization (e.g., via rsync), this assumption can be incorrect. The blob may be: * Not yet copied * Deleted by GC * Corrupted or unreadable This leads to runtime errors when the standby node tries to read missing blobs that were assumed present. +*Proposal:*+ Introduce a new *OSGi configuration property* in _StandbyStoreService_ called _strictBlobVerify_ When enabled: * {{RemoteBlobProcessor}} will *attempt to open and read a few bytes* from the blob to verify it is {*}physically present and readable{*}. * If the check fails, the blob is {*}re-fetched from the primary node{*}. This adds a safeguard against false positives from the reference existing check only approach and ensures Cold Standby is more robust in environments with non-instantaneous blob synchronization. It allows administrators to toggle strict blob verification behavior depending on their setup (e.g. dev vs production). +*Implementation Plan:*+ # Add _strictBlobVerify_ to _StandbyStoreServiceConfiguration_ # Read this flag in _StandbyStoreService_ # Pass it to _RemoteBlobProcessor_ # In {_}RemoteBlobProcessor.shouldFetchBinary(){_}, verify if reference readable if _strictBlobVerify_ has been specified. +*Benefits:*+ * Improves reliability of Cold Standby in environments with delayed or out-of-band blob sync * Prevents silent corruption or missing blobs * Configurable to preserve existing behavior for users who don’t need it Example Configuration: {noformat} # org.apache.jackrabbit.oak.plugins.segment.standby.store.StandbyStoreService.cfg strictBlobVerify=true{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)