kerneltime commented on PR #10062:
URL: https://github.com/apache/ozone/pull/10062#issuecomment-4598872770

   ## Review
   
   Thanks for landing the upgrade gate ahead of the parts-table-split write 
path — gating first is the right sequencing. The layout-feature wiring looks 
correct (no upgrade action / MLV constant needed for a behavior-only feature, 
and `TestOMVersionManager`'s contiguity check still passes), and the gate's 
pre-finalization determinism holds: MLV is advanced only by the Ratis-logged 
`OMFinalizeUpgradeRequest`, so `isAllowed` resolves identically across replicas 
at a given apply index.
   
   I think there's one blocker, and it's not in the gate itself — it's in the 
`CommitPart` null-check reorder that came along with it.
   
   ### Blocker — `CommitPart`: moving the `multipartKeyInfo == null` check 
above `getOmKeyInfo()` crashes the OM on the commit-after-abort path
   
   To read `multipartKeyInfo.getSchemaVersion()` safely, the `multipartKeyInfo 
== null -> NO_SUCH_MULTIPART_UPLOAD_ERROR` check was moved up to right after 
`getMultipartInfoTable().get()`, ahead of `omKeyInfo = getOmKeyInfo(...)`. But 
the `NO_SUCH` path is special: 
`S3MultipartUploadCommitPartResponse.checkAndUpdateDB()` overrides the flush to 
GC the orphaned open part and unconditionally dereferences 
`openPartKeyInfoToBeDeleted` (the `omKeyInfo` handed in by the request) — 
`S3MultipartUploadCommitPartResponse.java:88-97` (`.getUpdateID()`, 
`.getObjectID()`).
   
   - On `master`, `omKeyInfo` is loaded and null-checked (-> `KEY_NOT_FOUND`) 
**before** the `NO_SUCH` throw, so a `NO_SUCH` response always carried a 
non-null part to delete.
   - With this PR, `NO_SUCH` is thrown while `omKeyInfo` is still `null`; the 
`catch` passes that null into the response. Error responses are enqueued 
unconditionally (`RequestHandler.handleWriteRequest`), and 
`OzoneManagerDoubleBuffer` calls `checkAndUpdateDB` for them and `terminate()`s 
the OM on any `Throwable`. So the flush NPEs -> `ExitUtils.terminate` -> OM 
exits, and followers re-applying the same entry exit identically.
   
   This is exactly the scenario the GC branch exists for — a part commit 
arriving after the upload was aborted. 
`testValidateAndUpdateCacheMultipartNotFound` already sets up the precondition 
(random `uploadId` + `addKeyToOpenKeyTable`); it stays green only because it 
asserts the response status and never flushes the double buffer, so 
`checkAndUpdateDB` is never run in the unit test. It also affects legacy 
(schemaVersion 0) uploads, so it's independent of the new feature.
   
   Suggested fix: leave the `multipartKeyInfo == null` check at its original 
position (after `omKeyInfo` is resolved) and put the new gate immediately after 
it — the gate only needs `multipartKeyInfo` non-null, which the original 
ordering already guaranteed there. A regression test that actually flushes the 
double buffer (or a MiniOzoneCluster commit-after-abort case) would lock it 
down.
   
   ### Concerns
   
   - **The allowed (post-finalization) branch is never tested.** The base test 
harness mocks `OMLayoutVersionManager`, so `isAllowed(MPU_PARTS_TABLE_SPLIT)` 
is always `false` (and `MPU_PARTS_TABLE_SPLIT` appears in no test). The 
predicate `!isAllowed && schemaVersion != 0` is only ever driven from the left 
operand — a future inversion (dropped `!`, wrong feature) would still pass CI. 
Please add an `isAllowed -> true` case per request (Abort can assert full 
success; for Commit/Complete at least 
`assertNotEquals(NOT_SUPPORTED_OPERATION_PRIOR_FINALIZATION, ...)`).
   - **Unrelated change bundled in.** The `existingKeyInfo` swap to 
`getOmKeyInfoFromKeyTable(...)` in `S3MultipartUploadCompleteRequest` is 
behaviorally inert today (base impl equals the old 
`getKeyTable(...).get(dbOzoneKey)`; the FSO override only adds `setKeyName`, 
which `validateAtomicRewrite`/`validateIfMatchETag` don't read) and is 
unrelated to finalization — and line 333 still uses the raw lookup for the same 
key. Suggest dropping it here or splitting into its own JIRA.
   - **In-apply `isAllowed` read departs from the established pattern.** Every 
other layout gate — including the EC `CLUSTER_NEEDS_FINALIZATION` validators in 
these same two files — runs as a leader-only PRE_PROCESS 
`@RequestFeatureValidator`. This is the first to read `isAllowed` inside the 
replicated `validateAndUpdateCache` path, which the method's Javadoc cautions 
against. It's safe today (MLV advances synchronously via the logged finalize), 
but that relies on the finalization executor staying synchronous (it's 
pluggable). Worth considering gating on the request-stamped `LayoutVersion` 
from `preExecute`, or at least a comment documenting the determinism assumption.
   - **Dropped FSO coverage.** Rewriting 
`testValidateAndUpdateCacheKeyNotFound` to initiate a real MPU removes the FSO 
`DIRECTORY_NOT_FOUND` branch the old assertion covered; that path is now 
unasserted in the commit-part suite. Could be restored as a small separate test.
   
   ### Probably for the write-side PR (flagging so they aren't silent 
regressions later)
   
   Not reachable until a sibling PR writes `schemaVersion != 0`, so not 
blockers here:
   
   - `KeyManagerImpl.listParts` reads the inline `getPartKeyInfoMap()` (empty 
for split entries) -> would return 200 with zero parts.
   - The background `S3ExpiredMultipartUploadsAbortRequest` isn't gated and 
derives quota from the empty inline map -> on a downgraded cluster it would 
delete the row releasing 0 quota while user-facing Abort is gated off.
   - `NOT_SUPPORTED_OPERATION_PRIOR_FINALIZATION` has no case in 
`S3ErrorTable.translateResultCode` -> falls to `default` -> HTTP 500 to S3 
clients.
   
   ### Nits
   
   - Complete gate message reads "...for commit part request." — should be the 
complete path.
   - Double blank line after the gate block in `S3MultipartUploadAbortRequest`.
   - `schemaVersion` is `uint32` on the wire but narrowed to `byte` on load; 
gate tests `!= 0` while the model special-cases `== 1`. Not reachable today, 
but a multiple-of-256 value would truncate to 0 (fail-open) — worth validating 
to {0,1} at the boundary.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to