TheR1sing3un opened a new issue, #14365: URL: https://github.com/apache/hudi/issues/14365
### Bug Description **What happened:** Recently, we encountered a data compaction where some of the data written to the involved log did not appear in the new base file. After investigation, we found that this log did not appear as scheduled in the compaction plan, even though the completion time of this log file was significantly earlier than that of compaction instant. In our scenario, there is an interval of nearly three days between writing and merging. > How did we investigate 1. First, we found that the logic of the log file was filtered out here: <img width="1157" height="778" alt="Image" src="https://github.com/user-attachments/assets/670d83e6-2a45-40e3-aa40-138d98c5f5a8" /> 2. The reason for being filtered is that we read an instant from three days ago to find its corresponding completion time. This instant has already been archived, and we also queried the completion time in the archive timeline. It was found that it was also three days ago, which was correct. 3. However, this completion time was not obtained correctly here. Instead, null was returned, resulting in this log being determined to have been completed after this compaction instant, and thus this log was filtered out from the plan. > Then why didn't the compaction task retrieve the completion time from the archive timeline? Let's construct a case: - There are a total of 10 instant items from 1 to 10, and items 1 to 6 have all been archived. - And 1 to 2 are archived into a 1_2 archive parquet file - And 3 to 6 are archived into a 3_6 archive parquet file `archived: [1_2.parquet, 3_6.parquet] ; active: [7-10]` 1. Initialize the `CompletionTimeQueryViewV2`, and the cursor is located to the first active instant: <img width="1395" height="294" alt="Image" src="https://github.com/user-attachments/assets/af566899-6b3e-4100-a5c3-053a4dd01b35" /> 2. now we have stored instant from 7 to 10 in memory. 3. we try to get completion time for instant 5, it will lazy load instants started from 5: <img width="934" height="318" alt="Image" src="https://github.com/user-attachments/assets/67dd4aa0-1a0d-4c99-bf85-b213f91a1054" /> 4. In the following scanning and loading logic, we will scan to file 3_6.parquet and read instant 5 and 6 from it and store it in memory: <img width="1307" height="756" alt="Image" src="https://github.com/user-attachments/assets/65798860-a2f3-46f6-895c-46fd9b837a15" /> 5. And now, we try to get completion time for instant 4, it will trigger lazy load again, and now it will load with filter: [4, 5) 6. But, this time, we can't obtine the correct completion time, because we skipped reading the 3_6.parquet, and instant 4 is exactly in this file: <img width="1343" height="796" alt="Image" src="https://github.com/user-attachments/assets/a39ad201-9f8e-49d7-b6a7-58d3b25bb8f4" /> 7. As for why the file is filtered out, is it because the situation where the boundary of the filter is entirely contained within the min max of the file has not been taken into consider: <img width="1048" height="230" alt="Image" src="https://github.com/user-attachments/assets/c4dab8a0-1aa3-45aa-9bbd-791f02941fca" /> **Steps to reproduce:** You can reproduce by this simple ut: <img width="961" height="474" alt="Image" src="https://github.com/user-attachments/assets/046fc927-ce73-43a7-834e-f0dd05f8b020" /> <img width="1295" height="279" alt="Image" src="https://github.com/user-attachments/assets/7d247e12-0554-4ebb-aab7-0ec429de36d5" /> ### Environment **Hudi version:** 1.x **Query engine:** (Spark/Flink/Trino etc) **Relevant configs:** ### Logs and Stack Trace _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
