Hi, I have been working on Hudi for sometime and have an improvement suggestion.
When we build a CDC pipeline, generally the field used for partitioning is date (created_at), and the general format of created_at is yyyy-MM-dd HH:mm:ss.S. If we have this field formatted to yyyy/MM/dd, then your hive queries for fetching data between any two dates become much complex, which is the usual case. For example, 1. If the partitions are in format yyyy/MM/dd, then query to select data for all days between 2015-01-01 and 2015-03-01 would look like, SELECT * FROM db.table where year=2015 and ((month=01 or month=02) or (month=03 and day=01)) 2. Instead if partitions are in the format yyyy-MM-dd or yyyymmdd it supports direct queries on the data. e.g the above mentioned query would look like, SELECT * from db.table where DateStamp between ‘2015-01-01’ and ‘2015-03-01’. Reference - https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html <https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html> The proposal is to make the default partitioning to yyyy-mm-dd OR at least provide a provision to change the format. Please suggest on the above. Please find the jira raised here <https://issues.apache.org/jira/browse/HUDI-206> (HUDI-206). Regards, Pratyaksh
