zeroshade commented on code in PR #1185:
URL: https://github.com/apache/iceberg-go/pull/1185#discussion_r3399571129


##########
catalog/hadoop/hadoop.go:
##########
@@ -723,3 +773,26 @@ func (c *Catalog) LoadNamespaceProperties(_ 
context.Context, ns table.Identifier
 func (c *Catalog) UpdateNamespaceProperties(_ context.Context, _ 
table.Identifier, _ []string, _ iceberg.Properties) 
(catalog.PropertiesUpdateSummary, error) {
        return catalog.PropertiesUpdateSummary{}, errors.New("hadoop catalog: 
UpdateNamespaceProperties not yet implemented")
 }
+
+// CheckedMkdirAll is a helper function that checkes
+// all subdirectories of a given identifier path exist before creting the full 
path.
+// If errIfExists is true, it also checks that the full path does not already 
exist before creating it.
+func (c *Catalog) CheckedMkdirAll(id table.Identifier, errIfExists bool) error 
{

Review Comment:
   A few small things on this helper:
   - Consider unexporting (`checkedMkdirAll`) — internal helper, no need to 
widen `*Catalog`'s public API.
   - The `errIfExists=false` path has no caller and no test; drop the param or 
add coverage.
   - `Stat`-then-`MkdirAll` isn't atomic like the old single `os.Mkdir`, so the 
`ErrNamespaceAlreadyExists` guarantee is now best-effort under concurrent 
`CreateNamespace`. Likely fine, but worth a note.



##########
catalog/hadoop/hadoop.go:
##########
@@ -188,25 +188,39 @@ func (c *Catalog) defaultTableLocation(ident 
table.Identifier) string {
 func isTableDir(filesystem HadoopCatalogFS, path string) bool {
        metaDir := filepath.Join(path, "metadata")
 
-       entries, err := filesystem.ReadDir(metaDir)
-       if err != nil {
-               return false
-       }
+       found_metadata := false

Review Comment:
   Non-idiomatic snake_case — rename to `foundMetadata` (matches `foundEntries` 
added elsewhere in this PR).



##########
catalog/hadoop/hadoop.go:
##########
@@ -526,29 +549,38 @@ func (c *Catalog) ListTables(_ context.Context, ns 
table.Identifier) iter.Seq2[t
                        return
                }
 
-               entries, err := c.filesystem.ReadDir(nsPath)
-               if err != nil {
-                       yield(nil, fmt.Errorf("hadoop catalog: failed to read 
namespace directory: %w", err))
-
-                       return
-               }
+               err = c.filesystem.WalkDir(nsPath, func(path string, d 
fs.DirEntry, err error) error {
+                       if err != nil {
+                               return err
+                       }
 
-               for _, e := range entries {
-                       if !e.IsDir() {
-                               continue
+                       // Anything that is a file is not a namespace or table, 
so skip it.
+                       if !d.IsDir() {
+                               return nil
                        }
 
-                       child := filepath.Join(nsPath, e.Name())
-                       if !isTableDir(c.filesystem, child) {
-                               continue
+                       // Skip the namespace directory itself.
+                       if path == nsPath {
+                               return nil
                        }
 
+                       // Skip anything that is not a table directory.
+                       if !isTableDir(c.filesystem, path) {
+                               return nil
+                       }
                        ident := make(table.Identifier, len(ns)+1)
                        copy(ident, ns)
-                       ident[len(ns)] = e.Name()
+                       ident[len(ns)] = d.Name()
                        if !yield(ident, nil) {
-                               return
+                               return nil
                        }
+
+                       return nil

Review Comment:
   `ListTables`'s `WalkDir` callback returns `nil` in every branch, so it never 
stops descending (unlike `ListNamespaces` below, which uses `fs.SkipDir`). Both 
reproduced locally:
   
   - **Wrong results:** it recurses into child namespaces and yields nested 
tables under a flattened identifier — `ns/child/tbl` comes back as 
`["ns","tbl"]`.
   - **Panic on early `break`:** after `!yield` returns false, `return nil` 
keeps the walk going and calls `yield` again → `range function continued 
iteration after function for loop body returned false`.
   - **Cost:** walks each table's entire `data/`/`metadata/` subtree just to 
list names — expensive locally, much worse on the cloud backends this PR 
targets.
   
   Mirror `ListNamespaces`: return `fs.SkipDir` for the non-table dir (L569) 
and after yielding a table (L578), and `fs.SkipAll` when `!yield` (L575).



##########
catalog/hadoop/hadoop.go:
##########
@@ -188,25 +188,39 @@ func (c *Catalog) defaultTableLocation(ident 
table.Identifier) string {
 func isTableDir(filesystem HadoopCatalogFS, path string) bool {
        metaDir := filepath.Join(path, "metadata")
 
-       entries, err := filesystem.ReadDir(metaDir)
-       if err != nil {
-               return false
-       }
+       found_metadata := false
+       err := filesystem.WalkDir(metaDir, func(path string, d fs.DirEntry, err 
error) error {
+               if err != nil {
+                       return err
+               }
+
+               // Skip the root itself.
+               if path == metaDir {
+                       return nil
+               }
 
-       for _, e := range entries {
-               if e.IsDir() {
-                       continue
+               // Don't descend into subdirectories.
+               if d.IsDir() {
+                       return fs.SkipDir
                }
 
-               name := e.Name()
+               name := d.Name()
                if versionPattern.MatchString(name) ||
                        uuidMetadataPattern.MatchString(name) ||
                        name == "version-hint.text" {
-                       return true
+                       found_metadata = true
+
+                       return fs.SkipAll
                }
+
+               return nil
+       })
+
+       if err != nil && err != fs.SkipAll {

Review Comment:
   `LocalFS.WalkDir` delegates to `filepath.WalkDir`, which consumes 
`fs.SkipAll` and returns `nil`, so `&& err != fs.SkipAll` is unreachable. 
Simplify to `if err != nil { return false }`.



##########
catalog/hadoop/hadoop.go:
##########
@@ -723,3 +773,26 @@ func (c *Catalog) LoadNamespaceProperties(_ 
context.Context, ns table.Identifier
 func (c *Catalog) UpdateNamespaceProperties(_ context.Context, _ 
table.Identifier, _ []string, _ iceberg.Properties) 
(catalog.PropertiesUpdateSummary, error) {
        return catalog.PropertiesUpdateSummary{}, errors.New("hadoop catalog: 
UpdateNamespaceProperties not yet implemented")
 }
+
+// CheckedMkdirAll is a helper function that checkes

Review Comment:
   Typos: "checkes" → "checks", "creting" → "creating".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to