laskoviymishka commented on code in PR #848:
URL: https://github.com/apache/iceberg-go/pull/848#discussion_r3034272603


##########
table/orphan_cleanup.go:
##########
@@ -240,7 +242,16 @@ func (t Table) getReferencedFiles(fs iceio.IO) 
(map[string]bool, error) {
        versionHintPath := filepath.Join(metadata.Location(), "metadata", 
"version-hint.text")
        referenced[versionHintPath] = true
 
-       // TODO: Add statistics files support once iceberg-go exposes 
statisticsFiles()
+       for sf := range metadata.Statistics() {
+               if sf.StatisticsPath != "" {

Review Comment:
   nit: `StatisticsPath` is a required field per the Iceberg spec — an empty 
string here would mean the metadata itself is malformed. 
   
   The guard is harmless as defensive programming, but you could add a short 
comment like:
   
   ```
   // guard against malformed metadata` to clarify intent.
   ```



##########
table/orphan_cleanup_test.go:
##########
@@ -468,3 +468,57 @@ func TestOrphanCleanup_EdgeCases(t *testing.T) {
                assert.Equal(t, "host", result)
        })
 }
+
+func TestGetReferencedFiles_IncludesStatisticsFiles(t *testing.T) {
+       const metaJSON = `{
+  "format-version": 2,
+  "table-uuid": "9c12d441-03fe-4693-9a96-a0705ddf69c1",
+  "location": "s3://bucket/test/location",
+  "last-sequence-number": 0,
+  "last-updated-ms": 1602638573590,
+  "last-column-id": 1,
+  "current-schema-id": 0,
+  "schemas": [
+    {"type": "struct", "schema-id": 0, "fields": [{"id": 1, "name": "x", 
"required": true, "type": "long"}]}
+  ],
+  "default-spec-id": 0,
+  "partition-specs": [{"spec-id": 0, "fields": []}],
+  "last-partition-id": 0,
+  "default-sort-order-id": 0,
+  "sort-orders": [{"order-id": 0, "fields": []}],
+  "metadata-log": [],
+  "snapshot-log": [],
+  "statistics": [
+    {
+      "snapshot-id": 1,
+      "statistics-path": "s3://bucket/stats/table-stats.puffin",
+      "file-size-in-bytes": 1024,
+      "file-footer-size-in-bytes": 512,
+      "blob-metadata": []
+    }
+  ],
+  "partition-statistics": [
+    {
+      "snapshot-id": 1,
+      "statistics-path": "s3://bucket/stats/part-stats.puffin",
+      "file-size-in-bytes": 512
+    }
+  ]
+}`
+
+       meta, err := ParseMetadataString(metaJSON)
+       require.NoError(t, err)
+
+       tbl := Table{
+               metadata:         meta,
+               metadataLocation: 
"s3://bucket/test/location/metadata/v1.metadata.json",
+       }
+
+       // No snapshots: FileIO is not used; statistics paths must still be 
referenced.
+       refs, err := tbl.getReferencedFiles(nil)
+       require.NoError(t, err)
+
+       assert.True(t, refs["s3://bucket/stats/table-stats.puffin"])
+       assert.True(t, refs["s3://bucket/stats/part-stats.puffin"])
+       assert.True(t, refs[tbl.metadataLocation])

Review Comment:
   Consider adding a negative assertion to confirm the referenced set isn't 
trivially returning true for everything:
   
   ```
     assert.False(t, refs["s3://bucket/stats/not-referenced.puffin"])
   ```
   
   Also optionally: an entry with "statistics-path": "" to exercise the 
empty-string guard in the production code, and assert refs[""] == false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to