jonvex opened a new pull request, #11717: URL: https://github.com/apache/hudi/pull/11717
### Change Logs With timestamp keygen you can have a partition column with timestamps, but then use the keygen so it will create partitions based on days so that all records that have a timestamp on 7-31-2024 will go to the same parititon even though the values in the partition column differ by hours and minutes etc. This causes a problem with partition pruning. lets say you query "select * from table where partition < 7-31-2024 at 7am and partition > 7-31-2024 at 6am ". Since the file structure has the partition of just 7-31-2024, that will be interpreted as 7-31-2024 at 12am. So the partition will be pruned from the search space. This pr fixes the issue by rounding the query values based on the output format. The format of this is year month day, so it will round to the nearest day. The query for partition pruning will then be "select * from table where partition < 7-31-2024 and partition > 7-31-2024 ". This will still not yield any results because it requires the partition to be less than and greater than the same day. To fix that, we also replace any < or > with <= and >=. So now the query is "select * from table where partition <= 7-31-2024 and partition => 7-31-2024 ". 7-31-2024 will now not be pruned, and the original filter will be applied by spark. (we can replace all < and > because we are only looking at partition filters in a simple timestamp keygen scenario.) This does not fix cow or mor ro queries, because we treat those as just plain parquet tables and spark will handle the partition pruning. ### Impact fix bug for some scenarios ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
