rajagopal-ravikumar opened a new issue, #6772:
URL: https://github.com/apache/paimon/issues/6772

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Paimon version
   
   master commit : `0e64d8f637dab78eb45426f30e7f31efefafa3fb`
   
   ### Compute Engine
   
   Flink
   
   ### Minimal reproduce step
   
   **Bug Description**
   Using flink-cdc ingestion when using Paimon with Iceberg compatibility 
enabled, data written after schema evolution is not visible to Iceberg readers 
(Athena, Spark, Trino, etc.), even though the schema evolution itself succeeds.
   
     Root Cause: After schema evolution, Paimon:
     - ✅ Creates new data files with the evolved schema
     - ✅ Updates Iceberg snapshot metadata (increments record count)
     - ✅ Updates Iceberg schema metadata (adds new columns)
     - ❌ FAILS to add new data files to Iceberg manifest lists
   
     This causes Iceberg readers to only see data written before schema 
evolution.
   
   **Steps to reproduce**
     1. Create a Paimon table with Iceberg compatibility enabled:
     ```
   sink:
     type: paimon
     name: Paimon Sink
     table.properties.changelog-producer: input
     table.properties.write-only: false
     table.properties.write-buffer-size: 512mb
     table.properties.bucket: 2
     table.properties.consumer.expiration-time: 7d
     table.properties.snapshot.time-retained: 24h
     table.properties.sequence.field: __internal__op_ts
     catalog.properties.metastore: filesystem
     table.properties.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
     table.properties.metadata.iceberg.database: <database name>
     # catalog and table name must be set to unique values per test run to 
avoid conflicts
     catalog.properties.warehouse: s3a://<path>
     table.properties.metadata.iceberg.table: <table name>
     table.properties.metadata.iceberg.storage-location: table-location
     table.properties.metadata.iceberg.manifest-legacy-version: true
     table.properties.metadata.iceberg.manifest-compression: snappy
     table.properties.metadata.iceberg.uri: ''
     table.properties.metadata.iceberg.hive-conf-dir: ''
     table.properties.metadata.iceberg.hadoop-conf-dir: ''
     table.properties.metadata.iceberg.previous-versions-max: 24
     table.properties.metadata.iceberg.storage: hive-catalog
     table.properties.metadata.iceberg.hive-client-class: 
com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient
     table.properties.metadata.iceberg.glue.skip-archive: true
   ```
   
     2. Insert initial data (e.g., 200 rows)
       - Paimon creates data files and Iceberg manifest
       - Data is queryable via Iceberg ✓
       
     3. Perform schema evolution (add a column):
     ```
     ALTER TABLE test_table ADD COLUMN new_col VARCHAR(100);
   ```
     4. Insert new data (e.g., 3 rows) with the new column
       - Paimon creates new data files with evolved schema
       - New data is queryable via Paimon ✓
     5. Query via Iceberg (Athena/Spark):
   ```  
   SELECT COUNT(*) FROM iceberg_table;
   ```
     - Expected: 203 rows
     - Actual: 0 rows ❌
   ```  
   SELECT COUNT(*) FROM iceberg_table where new_col like '%value%';
   ```
     - Expected: 3 rows
     - Actual: 0 rows ❌
   
   **Evidence**
   
   Using the reproduction case above, I inspected the S3 files:
   
   1. Schema Evolution Succeeded:
   
     v4.metadata.json (latest metadata.json) shows two schemas
   ```
     "schemas": [
       {"schema-id": 0, "fields": [...]},           # Original schema
       {"schema-id": 1, "fields": [..., "new_col"]} # Evolved schema with new 
column
     ]
     "current-schema-id": 1  ✓
   ```
   
   2. Data Files Exist:
   ```  
   $ aws s3 ls s3://bucket/warehouse/db/table/bucket-0/
     2025-12-06 22:47:48  data-d531adf4-...-1.parquet  # Original data (99 rows)
                                            data-3d227847-...-1.parquet  # 
Original data (101 rows)
     2025-12-06 22:48:12  data-e1fff451-...-1.parquet  # New data AFTER 
evolution (3 rows)
   ```
   New parquet file contains the new column
   ```
     $ python3 -c "import pyarrow.parquet as pq; 
print(pq.read_table('/tmp/new-data.parquet').schema)"
     ...
     new_col: string
       -- field metadata --
       PARQUET:field_id: '6'
     Number of rows: 3
   ```
   
   3. Iceberg Manifest Missing New Data File:
   ``` 
    $ # Check manifest referenced by latest snapshot (snap-4)
     $ # Only shows 2 data files (the original ones):
     data-d531adf4-...-1.parquet  (99 records)   # Old file
     data-3d227847-...-1.parquet  (101 records)  # Old file
     # Missing: data-e1fff451-...-1.parquet  (3 records) ❌
   ```
   
   4. Snapshot Inconsistency:
   ```
     // v4.metadata.json snapshots
     {
       "snapshot-id": 1,
       "summary": {
         "total-records": "200",
         "total-data-files": "2",
         "added-data-files": "2"  // ✓
       }
     },
     {
       "snapshot-id": 2,
       "summary": {
         "total-records": "203",     // Knows 3 records added!
         "total-data-files": "2",    // But still only 2 files
         "added-data-files": "0"     // ❌ New file not tracked
       }
     }
   ```
   
   ### What doesn't meet your expectations?
   
     Expected Behavior
   
     After schema evolution, when Paimon writes new data files:
     1. New data files should be added to a new Iceberg manifest file
     2. New manifest should be referenced in the new snapshot
     3. Iceberg readers should see all data (old + new)
   
     ---
     Actual Behavior
   
     After schema evolution:
     1. Paimon creates new data files
     2. New data files are never added to any Iceberg manifest
     3. Iceberg readers can only see data written before schema evolution
     4. Snapshot metadata incorrectly reports record counts without 
corresponding data files
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to