Davis-Zhang-Onehouse opened a new pull request, #18817:
URL: https://github.com/apache/hudi/pull/18817

   ### Describe the issue this Pull Request addresses
   
   Follow-up to #18814, which introduced `FSUtils.normalizeBasePathForLocking` 
for the implicit-key lock providers. `StorageBasedLockProvider` derives 
`lockFilePath` directly from the raw \`hoodie.base.path\` config value:
   
   \`\`\`java
   this.basePath = config.getHudiTableBasePath();
   String lockFolderPath = StorageLockClient.getLockFolderPath(basePath);  // 
new StoragePath(basePath, ".locks")
   this.lockFilePath = new StoragePath(lockFolderPath, 
DEFAULT_TABLE_LOCK_FILE_NAME).toString();
   \`\`\`
   
   \`StoragePath\` absorbs trailing-slash drift, so \`s3://b/t\` and 
\`s3://b/t/\` already produce the same lock file. But two other classes of 
benign basePath drift do **not** get absorbed and route the writers to 
different lock files:
   
   | Drift | Same lock file today? | Reason |
   |---|---|---|
   | Trailing slash (\`s3://b/t\` vs \`s3://b/t/\`) | ✅ yes | \`StoragePath\` 
normalizes |
   | Scheme case (\`s3a://b/t\` vs \`s3://b/t\`) | ❌ **no** | Scheme is 
preserved verbatim through \`StoragePath\` |
   | Surrounding whitespace | ❌ **no** | Whitespace becomes part of the URI |
   
   Two writers that disagree on those benign formatting details acquire 
**different** lock files and lose mutual exclusion. Same class of bug as 
#18814, lower severity because the divergence is a visible second lock file 
rather than a silent hash split, but still a correctness issue.
   
   ### Summary and Changelog
   
   Route the raw basePath through \`FSUtils.normalizeBasePathForLocking\` (the 
helper introduced in #18814) before building \`lockFilePath\`. Using the same 
canonicalization API as the implicit-key LPs keeps the contract for "what 
counts as the same Hudi table" consistent across all lock providers.
   
   Also rename the stored field \`basePath\` → \`normalizedHudiTableBasePath\` 
so the name reflects what's actually stored.
   
   ### Impact
   
   - **Behavior**: \`StorageBasedLockProvider\` will produce the same 
\`lockFilePath\` for a Hudi table regardless of trailing-slash, surrounding 
whitespace, or \`s3a\`/\`s3\` scheme variations. No public API signature change.
   - **Performance**: Negligible — one extra \`trim()\` and a short scheme 
rewrite per LP construction.
   - **Compatibility (rollout caveat)**: Lock files keyed under the previous 
\`s3a://...\` form effectively orphan at deploy time, since the new code looks 
at \`s3://...\` for the same logical table. Operators upgrading should 
coordinate the deploy across all writers that share a Hudi table that uses this 
LP, or briefly quiesce writers while rolling out. Stale lock files can be 
cleaned up manually after the cutover. No rollout impact for callers that 
already pass \`s3://...\`.
   
   ### Risk Level
   
   low
   
   The change only affects how raw config strings are canonicalized before 
being fed to existing path-derivation code. Existing trailing-slash callers 
will see no change in derived lockFilePath (\`StoragePath\` already absorbed 
that). The only callers whose lock file moves are those that supplied an 
\`s3a://...\` basePath or surrounding whitespace.
   
   Mitigation:
   - Existing TestStorageBasedLockProvider suite continues to pass.
   - New tests cover the invariant directly: scheme drift, whitespace, and 
trailing-slash variants all produce the same lockFilePath.
   
   ### Documentation Update
   
   none — no user-facing config or website change.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   
   **Note**: This PR is stacked on top of #18814 — its diff currently includes 
the parent PR's changes until #18814 merges. The 
StorageBasedLockProvider-specific commit is the last one in the branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to