[I] Data file names always use hardcoded "00000" partition ID instead of tracking actual sequence [iceberg-go]

via GitHub Fri, 23 Jan 2026 14:19:28 -0800


akarpovspl opened a new issue, #697:
URL: https://github.com/apache/iceberg-go/issues/697


   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   ## Description
   
   ### Current Behavior
   
   The `GenerateDataFileName` function in `table/writer.go` hardcodes the 
partition ID to `00000`:
   
   ```go
   func (w WriteTask) GenerateDataFileName(extension string) string {
       // Mimics the behavior in the Java API:
       // 
https://github.com/apache/iceberg/blob/a582968975dd30ff4917fbbe999f1be903efac02/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L92-L101
       return fmt.Sprintf("00000-%d-%s.%s", w.ID, w.Uuid, extension)
   }
   ```
   
   This results in all data files having the same `00000` prefix regardless of 
when they were written:
   
   ```
   00000-0-7cebe869-463c-47d6-8fd6-64f4cb58f49e.parquet
   00000-0-34a719df-d820-493e-871b-83ad8c2144d4.parquet
   00000-0-d3036e2e-3980-4271-877b-33c7a32ea43e.parquet
   ```
   
   ### Expected Behavior
   
   According to the [Iceberg Java 
implementation](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java),
 the file naming format is:
   
   ```
   {partitionId}-{taskId}-{operationId}-{fileCount}.{extension}
   ```
   
   Where:
   - `partitionId` - identifies the partition (should increment or be unique 
per partition)
   - `taskId` - identifies the task/writer
   - `operationId` - UUID for the operation
   - `fileCount` - file counter within the task
   
   Files written in different commits should have incrementing identifiers.
   
   ### Impact
   
   | Aspect | Affected? |
   |--------|-----------|
   | Data integrity | No - UUIDs ensure uniqueness |
   | Query correctness | No - Iceberg metadata tracks files properly |
   | Iceberg spec compliance | Partial deviation from Java naming convention |
   | Operational visibility | Yes - cannot determine write order from filenames 
|
   | Debugging | Yes - harder to trace which commit created which file |
   
   ### Environment
   
   - **iceberg-go version**: v0.4.0
   - **Go version**: 1.23.x
   
   ### Reproduction Steps
   
   1. Create an Iceberg table
   2. Append data multiple times using `table.Append()`
   3. List the data files in S3/storage
   4. Observe all files have `00000` prefix
   
   ### Suggested Fix
   
   The `OutputFileFactory` equivalent in iceberg-go should track:
   1. A counter that persists across commits (or derives from table metadata)
   2. Proper partition ID based on the partition being written to
   
   This would bring the Go implementation in line with the Java reference 
implementation.
   
   ---
   
   **Labels:** `enhancement`, `writer`, `spec-compliance`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Data file names always use hardcoded "00000" partition ID instead of tracking actual sequence [iceberg-go]

Reply via email to