laskoviymishka commented on code in PR #1104:
URL: https://github.com/apache/iceberg-go/pull/1104#discussion_r3288444941
##########
table/orphan_cleanup.go:
##########
@@ -547,7 +562,23 @@ func deleteFilesParallel(fs iceio.IO, orphanFiles
[]string, cfg *orphanCleanupCo
// 3. Path separators and casing may vary across different systems and
configurations
// 4. Without normalization, semantically identical paths would be treated as
different,
// leading to false positives in orphan detection
-func normalizeFilePath(path string, cfg *orphanCleanupConfig) string {
+//
+// normalizeFilePath normalizes a file path for comparison, handling schemes,
authorities, and separators.
+// It also aligns file:// URIs and bare local file system paths so they
normalize to the same format.
+func normalizeFilePath(path string) string {
Review Comment:
This silently eats the authority on RFC 8089 URIs like
`file://localhost/foo/bar.parquet`, so `localhost` gets folded into the path.
The downstream effect is that `getReferencedFiles` stores keys via this
stripper while `identifyOrphanFiles` looks them up via
`normalizeFilePathWithConfig` (which does not strip `file:`), so on a LocalFS
where `WalkDir` returns `file:///foo/x.parquet` and the referenced set holds
`/foo/x.parquet`, the file is classified as an orphan and deleted.
I'd parse with `url.Parse` for the `file:` scheme and take `parsed.Path`
directly, then point both call sites at the same function. `file:///foo` and
`file://localhost/foo` both land at `/foo` and we don't have to reason about
trim ordering.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]