Hi, 

I have been working on Hudi for sometime and have an improvement suggestion. 

When we build a CDC pipeline, generally the field used for partitioning is date 
(created_at), and the general format of created_at is yyyy-MM-dd HH:mm:ss.S. If 
we have this field formatted to yyyy/MM/dd, then your hive queries for fetching 
data between any two dates become much complex, which is the usual case. For 
example, 

1. If the partitions are in format yyyy/MM/dd, then query to select data for 
all days between 2015-01-01 and 2015-03-01 would look like, 

SELECT * FROM db.table where year=2015 and ((month=01 or month=02) or (month=03 
and day=01))

2. Instead if partitions are in the format yyyy-MM-dd or yyyymmdd it supports 
direct queries on the data. 
e.g the above mentioned query would look like, 

SELECT * from db.table where DateStamp between ‘2015-01-01’ and ‘2015-03-01’.


Reference - 
https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html
 
<https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html>

The proposal is to make the default partitioning to yyyy-mm-dd OR at least 
provide a provision to change the format. 

Please suggest on the above. Please find the jira raised here 
<https://issues.apache.org/jira/browse/HUDI-206> (HUDI-206).


Regards, 
Pratyaksh

Reply via email to