[I] [Bug] RowId mismatch in file and metadata [paimon]

via GitHub Thu, 04 Dec 2025 02:04:48 -0800


Kkkaneki-k opened a new issue, #6747:
URL: https://github.com/apache/paimon/issues/6747


   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Paimon version
   
   Master
   
   ### Compute Engine
   
   Spark
   
   ### Minimal reproduce step
   
   ```
   // first part
   spark.sql("CREATE TABLE t (id INT, data INT) TBLPROPERTIES 
('row-tracking.enabled' = 'true')")
   spark.sql("INSERT INTO t SELECT /*+ REPARTITION(1) */ id, id AS data FROM 
range(1, 4)")
   
   // second part
   spark.sql("UPDATE t SET data = 22 WHERE id = 2")
   
   // third part
   spark.sql("INSERT INTO t VALUES (4, 4), (5, 5)")
   spark.sql("SELECT *, _ROW_ID, _SEQUENCE_NUMBER FROM t").show
   /* the result of select
   +---+----+-------+----------------+
   | id|data|_ROW_ID|_SEQUENCE_NUMBER|
   +---+----+-------+----------------+
   |  1|   1|      0|               1|
   |  2|  22|      1|               2|
   |  3|   3|      2|               1|
   |  4|   4|      6|               3|
   |  5|   5|      7|               3|
   +---+----+-------+----------------+
   */
   ```
   
   ### What doesn't meet your expectations?
   
   When the second part of the code above (the update operation) is executed, 
the original data is read from the old file and written to a new file, along 
with _ROW_ID and _SEQUENCE_NUMBER. At this point, the new file contains both 
_ROW_ID and _SEQUENCE_NUMBER, but the firstRowId in the file metadata is null. 
Later, during the commit phase, the firstRowId in the file metadata is assigned 
based on the nextRowId from the snapshot. This leads to a mismatch between the 
rowIds in the file and the metadata. As a result, if we want to query data by 
rowId, some records may be missed, because paimon core skips certain files 
according to the firstRowId in the metadata when generating a scan plan.
   Additionally, when the third part of the code (the insert operation) is 
executed, this problem also causes the newly inserted rows to have unexpected 
_ROW_ID.
   A visualization of this problem is provided below.
   <img width="1478" height="492" alt="Image" 
src="https://github.com/user-attachments/assets/94fcbec0-64aa-47ba-834d-58e5c8d14d8a";
 />
   This problem likewise exists for the merge into operation (when only 
'row-tracking.enabled' = 'true' is set). To resolve this problem, it may be 
necessary to assign the firstRowId in the metadata during the write phase for 
update and merge into scenarios, rather than delaying it until the commit phase.
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug] RowId mismatch in file and metadata [paimon]

Reply via email to