twuebi opened a new pull request, #1075:
URL: https://github.com/apache/iceberg-go/pull/1075

   Adds on-the-wire encoding for DataFile and FileScanTask to enable 
distributed compaction. With #1033, we decomposed compaction into a coordinator 
and worker portion. This enables plumbing them together over the wire. 
FileScanTask need to go coordinator -> worker (data file, delete files, scan 
range + v3 row lineage) and DataFile from worker -> coordinator collecting them 
for the commit.
   
   iceberg.EncodeDataFile / DecodeDataFile reuse the manifest-entry Avro 
encoding so the bytes on the wire are the same bytes a manifest carries. The 
dataFile struct's avro tags remain the single source of truth. Adding a field 
to DataFile extends what the helpers transport, with no wire mirror to keep in 
sync. Encoding is non-mutating and thread-safe: a fresh *dataFile is cloned via 
reflection over avro tags, so the source is untouched. 
   
   table.EncodeFileScanTask / DecodeFileScanTask layer on top: each embedded 
DataFile is iceberg-encoded, then wrapped alongside the scan range and v3 row 
lineage in a small Avro envelope.
   
   Design notes: 
   - The receiver supplies (spec, schema, version) out of band. Both sides in 
the distributed-compaction design already hold table metadata, and the 
per-(specID, version) avro schema is cached. Happy to switch to a 
self-describing payload if preferred.
   - distinct_counts is not transported, the iceberg-go manifest-entry schema 
doesn't carry it on any version. Callers that need it must transport it 
separately.
   - Reflection runs once per encode and is not in a hot path I can see. Happy 
to precompute the avro field index at init if preferred.
   - An anonymous `var _ = FileScanTask{...}` literal next to the codec is a 
compile-time drift guard: adding/retyping/reordering a FileScanTask field 
breaks the build, forcing a deliberate call on whether it must cross the wire.
   
   Format versions 1, 2, and 3 are supported. No change to existing manifest 
read/write paths.
   
   Tests: round-trip across v1/v2/v3 with fully populated DataFiles, 
foreign-impl rejection, partition-data idempotence, and FileScanTask shape 
including v3 row lineage.
   
   cc @laskoviymishka continues building blocks for distributed compaction


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to