akarpovspl opened a new issue, #697:
URL: https://github.com/apache/iceberg-go/issues/697
### Apache Iceberg version
main (development)
### Please describe the bug 🐞
## Description
### Current Behavior
The `GenerateDataFileName` function in `table/writer.go` hardcodes the
partition ID to `00000`:
```go
func (w WriteTask) GenerateDataFileName(extension string) string {
// Mimics the behavior in the Java API:
//
https://github.com/apache/iceberg/blob/a582968975dd30ff4917fbbe999f1be903efac02/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L92-L101
return fmt.Sprintf("00000-%d-%s.%s", w.ID, w.Uuid, extension)
}
```
This results in all data files having the same `00000` prefix regardless of
when they were written:
```
00000-0-7cebe869-463c-47d6-8fd6-64f4cb58f49e.parquet
00000-0-34a719df-d820-493e-871b-83ad8c2144d4.parquet
00000-0-d3036e2e-3980-4271-877b-33c7a32ea43e.parquet
```
### Expected Behavior
According to the [Iceberg Java
implementation](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java),
the file naming format is:
```
{partitionId}-{taskId}-{operationId}-{fileCount}.{extension}
```
Where:
- `partitionId` - identifies the partition (should increment or be unique
per partition)
- `taskId` - identifies the task/writer
- `operationId` - UUID for the operation
- `fileCount` - file counter within the task
Files written in different commits should have incrementing identifiers.
### Impact
| Aspect | Affected? |
|--------|-----------|
| Data integrity | No - UUIDs ensure uniqueness |
| Query correctness | No - Iceberg metadata tracks files properly |
| Iceberg spec compliance | Partial deviation from Java naming convention |
| Operational visibility | Yes - cannot determine write order from filenames
|
| Debugging | Yes - harder to trace which commit created which file |
### Environment
- **iceberg-go version**: v0.4.0
- **Go version**: 1.23.x
### Reproduction Steps
1. Create an Iceberg table
2. Append data multiple times using `table.Append()`
3. List the data files in S3/storage
4. Observe all files have `00000` prefix
### Suggested Fix
The `OutputFileFactory` equivalent in iceberg-go should track:
1. A counter that persists across commits (or derives from table metadata)
2. Proper partition ID based on the partition being written to
This would bring the Go implementation in line with the Java reference
implementation.
---
**Labels:** `enhancement`, `writer`, `spec-compliance`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]