mzzz-zzm commented on issue #1001:
URL: https://github.com/apache/iceberg-go/issues/1001#issuecomment-4402834186

   Hi — flagging a finding before I open a PR.
   
   On `main` (`c62af3c`), `data_file.distinct_counts` is **already** absent 
from the on-disk Avro for every manifest version, not just v3. The reason is 
that the per-version record schemas in 
[`internal/avro_schemas.go`](https://github.com/apache/iceberg-go/blob/main/internal/avro_schemas.go)
 — `data_file_v1` (L238), `data_file_v2` (L276), `data_file_v3` (L360) — do not 
declare a `distinct_counts` field. The `hamba/avro` encoder writes only what 
the schema declares, so the `dataFile.DistinctCounts` Go field (tag 
`avro:"distinct_counts"`) is silently dropped on encode for v1/v2/v3 alike.
   
   A round-trip test on `main` confirms this: writing a v2 manifest with 
`DataFileBuilder.DistinctValueCounts(map[int]int64{1: 42})` and reading it back 
yields `nil` for `DataFile().DistinctValueCounts()` — no production change 
applied.
   
   Implications:
   
   1. The intended v3 outcome (omit `distinct_counts`) is already in effect, 
but for the wrong reason.
   2. **v2 is currently non-compliant**: the v2 spec still requires 
`distinct_counts` to be writable, and we drop it. That looks like a separate, 
pre-existing bug worth tracking.
   3. A purely writer-side gate inside `v3writerImpl.prepareEntry` (clear 
`DistinctCounts`) would be a no-op against today's encoder — useful only as a 
defensive guard if v2 is later fixed by adding the field back to `data_file_v2`.
   
   Three ways forward — happy to take whichever direction you prefer:
   
   - **(a) Close as already-satisfied for v3**, file a separate issue for the 
v2 regression (add `distinct_counts` to `data_file_v2`, leave it out of 
`data_file_v3`). Cleanest split.
   - **(b) Single PR for both halves**: add `distinct_counts` to `data_file_v1` 
+ `data_file_v2`, keep it out of `data_file_v3`, and add a 
`v3writerImpl.prepareEntry` guard so the Go-side `*[]colMap` cannot leak into a 
v3 schema if someone later widens `data_file_v3`. Larger scope than "good first 
issue", but matches Java parity end-to-end.
   - **(c) Defensive-only**: land the `prepareEntry` guard plus a v3-only 
round-trip test as a forward-compatibility insurance, and leave the v2 fix to a 
follow-up. Minimal diff, no functional change today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to