LuciferYang opened a new pull request, #64768:
URL: https://github.com/apache/doris/pull/64768

   ### What problem does this PR solve?
   
   Issue Number: close #64767
   
   Problem Summary:
   
   `PathUtils.equalsIgnoreSchemeIfOneIsS3(p1, p2)` (used by `HMSTransaction` to 
decide whether a Hive commit needs a rename) compared paths inconsistently 
across its two branches:
   
   - same scheme โ†’ full-string `equalsIgnoreCase` (trailing slash significant, 
case-insensitive);
   - cross-scheme with `s3` โ†’ normalized authority+path via `Objects.equals` 
(trailing slash stripped, case-sensitive).
   
   So the result for one URI depended on the *other* URI's scheme, and 
same-scheme comparisons wrongly ignored case (S3 keys are case-sensitive).
   
   This PR unifies both branches into one rule: when the schemes are equal 
(case-insensitively, per RFC 3986 ยง3.1) **or** one side is `s3`, compare only 
the authority (bucket/host) and path โ€” scheme ignored, **trailing slashes 
insignificant**, **case-sensitive** on the raw (percent-encoded) components; 
otherwise the locations are not equal. This matches the original `normalize()` 
intent and the caller's "no rename when the location is identical" comment, and 
applies the slash/case handling consistently regardless of whether the two 
schemes match.
   
   Inputs that are malformed for object storage fall back to exact string 
comparison so they cannot spuriously match: opaque URIs (`s3:bucket/key`), 
scheme-with-null-authority triple-slash forms (`s3:///path`), 
authority-with-null-scheme network-path references (`//bucket/path`), and parse 
failures. Percent-encoded slashes (`%2F`) stay distinct from real path 
separators.
   
   The change was hardened via a multi-persona adversarial review loop run to 
convergence (5 consecutive clean rounds); the extra rounds mainly added test 
coverage.
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test
       - [x] Unit Test (`PathUtilsTest`, 23 cases covering the consistency 
contract plus opaque/encoded/triple-slash/network-path/null/scheme-family edge 
cases)
   
   - Behavior changed:
       - [x] Yes. `equalsIgnoreSchemeIfOneIsS3` now treats trailing slashes as 
insignificant and the authority+path comparison as case-sensitive 
**consistently** for same-scheme and cross-scheme inputs. For realistic 
fully-qualified Hive S3/OSS locations the rename decision is unchanged; the 
difference only appears for trailing-slash-only or case-only differences and 
for malformed inputs.
   
   - Does this need documentation?
       - [x] No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to