vamshikrishnakyatham opened a new issue, #14016:
URL: https://github.com/apache/hudi/issues/14016

   ### Bug Description
   
   **What happened:**
   
   CDC query returns different number of rows when ORDER BY ts_ms is added. For 
e.g., without ORDER BY: 9 rows returned. With ORDER BY: 6 rows returned. Some 
records are missing in the ordered result.
   
   **What you expected:**
   
   CDC query should return the same number of rows regardless of whether ORDER 
BY clause is used. The ordering should not affect which records are included in 
the result set.
   
   **Steps to reproduce:**
   1. Create a Hudi table with CDC enabled and perform insert/update/delete 
operations
   2. Run CDC query without ORDER BY: SELECT op, ts_ms, ... FROM 
hudi_table_changes('table_path', 'cdc', 'earliest')
   3. Run the same CDC query with ORDER BY: SELECT op, ts_ms, ... FROM 
hudi_table_changes('table_path', 'cdc', 'earliest') ORDER BY ts_ms ASC
   4. Observe different row counts and results between the two queries
   
   Run results:
   
   ```
   spark-sql (default)> SELECT op, ts_ms, get_json_object(before, '$.ts') AS 
before_ts, get_json_object(before, '$.rider') as before_rider, 
get_json_object(after,  '$.rider') AS after_rider FROM 
hudi_table_changes('file:///tmp/hudi_test_table', 'cdc', 'earliest')
                      > ;
   i       20250924110448254       NULL    NULL    rider-E
   i       20250924110628340       NULL    NULL    rider-G
   u       20250924110905831       1695516137      rider-G rider-E
   i       20250924110448254       NULL    NULL    rider-C
   u       20250924110905831       1695091554      rider-C rider-E
   i       20250924110448254       NULL    NULL    rider-A
   i       20250924110448254       NULL    NULL    rider-F
   u       20250924110551107       1695516137      rider-F rider-E
   d       20250924110611165       1695516137      rider-E NULL
   Time taken: 0.141 seconds, Fetched 9 row(s)
   
   spark-sql (default)> SELECT op, ts_ms, get_json_object(before, '$.ts') AS 
before_ts, get_json_object(before, '$.rider') as before_rider, get_json_obj
   ect(after,  '$.rider') AS after_rider FROM 
hudi_table_changes('file:///tmp/hudi_test_table', 'cdc', 'earliest') order by 
ts_ms asc;
   i       20250924110448254       NULL    NULL    rider-A
   u       20250924110551107       1695516137      rider-F rider-E
   d       20250924110611165       1695516137      rider-E NULL
   i       20250924110628340       NULL    NULL    rider-G
   u       20250924110905831       1695516137      rider-G rider-E
   u       20250924110905831       1695091554      rider-C rider-E
   Time taken: 0.214 seconds, Fetched 6 row(s)
   ```
   
   ### Environment
   
   **Hudi version:** 1.1
   **Query engine:** (Spark/Flink/Trino etc)
   **Relevant configs:**
   
   
   ### Logs and Stack Trace
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to