Hi Pratyaksh,
The partitioning format is pluggable in Hudi.
1. For Hudi Writing, you can simply use one of the several implementations of 
org.apache.hudi.KeyGenerator or write your own implementation to control 
partition path format. You can configure partition-path using 
https://hudi.incubator.apache.org/configurations.html#KEYGENERATOR_CLASS_OPT_KEY
2. For Hive Syncing, there are again some default implementations for 
org.apache.hudi.hive.PartitionValueExtractor. You can also write your custom 
partition value extractor and configure using 
https://hudi.incubator.apache.org/configurations.html#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY

Thanks,Balaji.V    On Tuesday, August 13, 2019, 03:23:57 AM PDT, Pratyaksh 
Sharma <[email protected]> wrote:  
 
 Hi, 

I have been working on Hudi for sometime and have an improvement suggestion. 

When we build a CDC pipeline, generally the field used for partitioning is date 
(created_at), and the general format of created_at is yyyy-MM-dd HH:mm:ss.S. If 
we have this field formatted to yyyy/MM/dd, then your hive queries for fetching 
data between any two dates become much complex, which is the usual case. For 
example, 

1. If the partitions are in format yyyy/MM/dd, then query to select data for 
all days between 2015-01-01 and 2015-03-01 would look like, 

SELECT * FROM db.table where year=2015 and ((month=01 or month=02) or (month=03 
and day=01))

2. Instead if partitions are in the format yyyy-MM-dd or yyyymmdd it supports 
direct queries on the data. 
e.g the above mentioned query would look like, 

SELECT * from db.table where DateStamp between ‘2015-01-01’ and ‘2015-03-01’.


Reference - 
https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html
 
<https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html>

The proposal is to make the default partitioning to yyyy-mm-dd OR at least 
provide a provision to change the format. 

Please suggest on the above. Please find the jira raised here 
<https://issues.apache.org/jira/browse/HUDI-206> (HUDI-206).


Regards, 
Pratyaksh  

Reply via email to