0dunay0 opened a new pull request, #6503:
URL: https://github.com/apache/paimon/pull/6503
<!-- Please specify the module before the PR name: [core] ... or [flink] ...
-->
### Purpose
Linked issue: close #6502
This PR fixes Redshift Spectrum querying for Paimon tables with Iceberg
compatibility by populating optional snapshot summary fields that are required
by certain Iceberg query engines.
When Paimon generates Iceberg metadata, it currently only includes the
`operation` field in snapshot summaries. While the Iceberg specification marks
most summary fields as "optional," some query engines (notably AWS Redshift
Spectrum) require fields like `total-records` to successfully parse and query
tables.
This causes Paimon+Iceberg tables to be queryable in AWS Athena but fail in
Redshift Spectrum with error: `Required field total-records missing`.
## Changes
### Added `computeSnapshotSummary()` Helper Method
Aggregates statistics from `IcebergManifestFileMeta` objects to compute
snapshot-level metrics including:
**Required fields (always present):**
- `total-records` - Total number of live records
- `total-data-files` - Total number of live data files
- `total-delete-files` - Total number of live delete files
- `total-position-deletes` - Total position delete records
- `total-equality-deletes` - Always "0" (Paimon doesn't use equality deletes)
**Optional fields (when non-zero):**
- `added-data-files`, `added-records`, `added-files-size`
- `deleted-data-files`, `deleted-records`, `deleted-files-size`
- `total-files-size`
- `changed-partition-count`
### Tests
Updated `IcebergMetadataTest.java`
### API and Format
N/A
### Documentation
Reintroduces a feature that was previously available.
```
aws s3 cp
s3://some-bucket/paimon/warehouse/somedb.db/some_table/metadata/v190.metadata.json
- | jq '.snapshots[0].summary'
{
"added-data-files": "2",
"total-equality-deletes": "0",
"added-records": "83282",
"deleted-data-files": "0",
"deleted-records": "0",
"total-records": "83282",
"deleted-files-size": "0",
"changed-partition-count": "1",
"total-position-deletes": "0",
"added-files-size": "4683766",
"total-delete-files": "0",
"total-files-size": "4683766",
"total-data-files": "2",
"operation": "append"
}
```
Redshift Spectrum can now query the table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]