mzzz-zzm commented on issue #1001: URL: https://github.com/apache/iceberg-go/issues/1001#issuecomment-4402834186
Hi — flagging a finding before I open a PR. On `main` (`c62af3c`), `data_file.distinct_counts` is **already** absent from the on-disk Avro for every manifest version, not just v3. The reason is that the per-version record schemas in [`internal/avro_schemas.go`](https://github.com/apache/iceberg-go/blob/main/internal/avro_schemas.go) — `data_file_v1` (L238), `data_file_v2` (L276), `data_file_v3` (L360) — do not declare a `distinct_counts` field. The `hamba/avro` encoder writes only what the schema declares, so the `dataFile.DistinctCounts` Go field (tag `avro:"distinct_counts"`) is silently dropped on encode for v1/v2/v3 alike. A round-trip test on `main` confirms this: writing a v2 manifest with `DataFileBuilder.DistinctValueCounts(map[int]int64{1: 42})` and reading it back yields `nil` for `DataFile().DistinctValueCounts()` — no production change applied. Implications: 1. The intended v3 outcome (omit `distinct_counts`) is already in effect, but for the wrong reason. 2. **v2 is currently non-compliant**: the v2 spec still requires `distinct_counts` to be writable, and we drop it. That looks like a separate, pre-existing bug worth tracking. 3. A purely writer-side gate inside `v3writerImpl.prepareEntry` (clear `DistinctCounts`) would be a no-op against today's encoder — useful only as a defensive guard if v2 is later fixed by adding the field back to `data_file_v2`. Three ways forward — happy to take whichever direction you prefer: - **(a) Close as already-satisfied for v3**, file a separate issue for the v2 regression (add `distinct_counts` to `data_file_v2`, leave it out of `data_file_v3`). Cleanest split. - **(b) Single PR for both halves**: add `distinct_counts` to `data_file_v1` + `data_file_v2`, keep it out of `data_file_v3`, and add a `v3writerImpl.prepareEntry` guard so the Go-side `*[]colMap` cannot leak into a v3 schema if someone later widens `data_file_v3`. Larger scope than "good first issue", but matches Java parity end-to-end. - **(c) Defensive-only**: land the `prepareEntry` guard plus a v3-only round-trip test as a forward-compatibility insurance, and leave the v2 fix to a follow-up. Minimal diff, no functional change today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
