0dunay0 opened a new pull request, #6503:
URL: https://github.com/apache/paimon/pull/6503

   <!-- Please specify the module before the PR name: [core] ... or [flink] ... 
-->
   
   ### Purpose
   
   Linked issue: close #6502 
   
   This PR fixes Redshift Spectrum querying for Paimon tables with Iceberg 
compatibility by populating optional snapshot summary fields that are required 
by certain Iceberg query engines.
   
   When Paimon generates Iceberg metadata, it currently only includes the 
`operation` field in snapshot summaries. While the Iceberg specification marks 
most summary fields as "optional," some query engines (notably AWS Redshift 
Spectrum) require fields like `total-records` to successfully parse and query 
tables.
   
   This causes Paimon+Iceberg tables to be queryable in AWS Athena but fail in 
Redshift Spectrum with error: `Required field total-records missing`.
   
   ## Changes
   
   ### Added `computeSnapshotSummary()` Helper Method
   
   Aggregates statistics from `IcebergManifestFileMeta` objects to compute 
snapshot-level metrics including:
   
   **Required fields (always present):**
   - `total-records` - Total number of live records
   - `total-data-files` - Total number of live data files  
   - `total-delete-files` - Total number of live delete files
   - `total-position-deletes` - Total position delete records
   - `total-equality-deletes` - Always "0" (Paimon doesn't use equality deletes)
   
   **Optional fields (when non-zero):**
   - `added-data-files`, `added-records`, `added-files-size`
   - `deleted-data-files`, `deleted-records`, `deleted-files-size`
   - `total-files-size`
   - `changed-partition-count`
   
   ### Tests
   
   Updated `IcebergMetadataTest.java`
   
   ### API and Format
   
   N/A
   
   ### Documentation
   
   Reintroduces a feature that was previously available.
   
   ```
   aws s3 cp 
s3://some-bucket/paimon/warehouse/somedb.db/some_table/metadata/v190.metadata.json
 - | jq '.snapshots[0].summary'
   
   {
     "added-data-files": "2",
     "total-equality-deletes": "0",
     "added-records": "83282",
     "deleted-data-files": "0",
     "deleted-records": "0",
     "total-records": "83282",
     "deleted-files-size": "0",
     "changed-partition-count": "1",
     "total-position-deletes": "0",
     "added-files-size": "4683766",
     "total-delete-files": "0",
     "total-files-size": "4683766",
     "total-data-files": "2",
     "operation": "append"
   }
   ```
   
   Redshift Spectrum can now query the table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to