rajagopal-ravikumar opened a new issue, #6772: URL: https://github.com/apache/paimon/issues/6772
### Search before asking - [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar. ### Paimon version master commit : `0e64d8f637dab78eb45426f30e7f31efefafa3fb` ### Compute Engine Flink ### Minimal reproduce step **Bug Description** Using flink-cdc ingestion when using Paimon with Iceberg compatibility enabled, data written after schema evolution is not visible to Iceberg readers (Athena, Spark, Trino, etc.), even though the schema evolution itself succeeds. Root Cause: After schema evolution, Paimon: - ✅ Creates new data files with the evolved schema - ✅ Updates Iceberg snapshot metadata (increments record count) - ✅ Updates Iceberg schema metadata (adds new columns) - ❌ FAILS to add new data files to Iceberg manifest lists This causes Iceberg readers to only see data written before schema evolution. **Steps to reproduce** 1. Create a Paimon table with Iceberg compatibility enabled: ``` sink: type: paimon name: Paimon Sink table.properties.changelog-producer: input table.properties.write-only: false table.properties.write-buffer-size: 512mb table.properties.bucket: 2 table.properties.consumer.expiration-time: 7d table.properties.snapshot.time-retained: 24h table.properties.sequence.field: __internal__op_ts catalog.properties.metastore: filesystem table.properties.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem table.properties.metadata.iceberg.database: <database name> # catalog and table name must be set to unique values per test run to avoid conflicts catalog.properties.warehouse: s3a://<path> table.properties.metadata.iceberg.table: <table name> table.properties.metadata.iceberg.storage-location: table-location table.properties.metadata.iceberg.manifest-legacy-version: true table.properties.metadata.iceberg.manifest-compression: snappy table.properties.metadata.iceberg.uri: '' table.properties.metadata.iceberg.hive-conf-dir: '' table.properties.metadata.iceberg.hadoop-conf-dir: '' table.properties.metadata.iceberg.previous-versions-max: 24 table.properties.metadata.iceberg.storage: hive-catalog table.properties.metadata.iceberg.hive-client-class: com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient table.properties.metadata.iceberg.glue.skip-archive: true ``` 2. Insert initial data (e.g., 200 rows) - Paimon creates data files and Iceberg manifest - Data is queryable via Iceberg ✓ 3. Perform schema evolution (add a column): ``` ALTER TABLE test_table ADD COLUMN new_col VARCHAR(100); ``` 4. Insert new data (e.g., 3 rows) with the new column - Paimon creates new data files with evolved schema - New data is queryable via Paimon ✓ 5. Query via Iceberg (Athena/Spark): ``` SELECT COUNT(*) FROM iceberg_table; ``` - Expected: 203 rows - Actual: 0 rows ❌ ``` SELECT COUNT(*) FROM iceberg_table where new_col like '%value%'; ``` - Expected: 3 rows - Actual: 0 rows ❌ **Evidence** Using the reproduction case above, I inspected the S3 files: 1. Schema Evolution Succeeded: v4.metadata.json (latest metadata.json) shows two schemas ``` "schemas": [ {"schema-id": 0, "fields": [...]}, # Original schema {"schema-id": 1, "fields": [..., "new_col"]} # Evolved schema with new column ] "current-schema-id": 1 ✓ ``` 2. Data Files Exist: ``` $ aws s3 ls s3://bucket/warehouse/db/table/bucket-0/ 2025-12-06 22:47:48 data-d531adf4-...-1.parquet # Original data (99 rows) data-3d227847-...-1.parquet # Original data (101 rows) 2025-12-06 22:48:12 data-e1fff451-...-1.parquet # New data AFTER evolution (3 rows) ``` New parquet file contains the new column ``` $ python3 -c "import pyarrow.parquet as pq; print(pq.read_table('/tmp/new-data.parquet').schema)" ... new_col: string -- field metadata -- PARQUET:field_id: '6' Number of rows: 3 ``` 3. Iceberg Manifest Missing New Data File: ``` $ # Check manifest referenced by latest snapshot (snap-4) $ # Only shows 2 data files (the original ones): data-d531adf4-...-1.parquet (99 records) # Old file data-3d227847-...-1.parquet (101 records) # Old file # Missing: data-e1fff451-...-1.parquet (3 records) ❌ ``` 4. Snapshot Inconsistency: ``` // v4.metadata.json snapshots { "snapshot-id": 1, "summary": { "total-records": "200", "total-data-files": "2", "added-data-files": "2" // ✓ } }, { "snapshot-id": 2, "summary": { "total-records": "203", // Knows 3 records added! "total-data-files": "2", // But still only 2 files "added-data-files": "0" // ❌ New file not tracked } } ``` ### What doesn't meet your expectations? Expected Behavior After schema evolution, when Paimon writes new data files: 1. New data files should be added to a new Iceberg manifest file 2. New manifest should be referenced in the new snapshot 3. Iceberg readers should see all data (old + new) --- Actual Behavior After schema evolution: 1. Paimon creates new data files 2. New data files are never added to any Iceberg manifest 3. Iceberg readers can only see data written before schema evolution 4. Snapshot metadata incorrectly reports record counts without corresponding data files ### Anything else? _No response_ ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
