Hi,

Do these hooks seem sufficient to support what you are looking for?

On Tue, Aug 13, 2019 at 8:16 PM [email protected] <[email protected]>
wrote:

>
> Hi Pratyaksh,
> The partitioning format is pluggable in Hudi.
> 1. For Hudi Writing, you can simply use one of the several implementations
> of org.apache.hudi.KeyGenerator or write your own implementation to control
> partition path format. You can configure partition-path using
> https://hudi.incubator.apache.org/configurations.html#KEYGENERATOR_CLASS_OPT_KEY
> 2. For Hive Syncing, there are again some default implementations for
> org.apache.hudi.hive.PartitionValueExtractor. You can also write your
> custom partition value extractor and configure using
> https://hudi.incubator.apache.org/configurations.html#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY
>
> Thanks,Balaji.V    On Tuesday, August 13, 2019, 03:23:57 AM PDT, Pratyaksh
> Sharma <[email protected]> wrote:
>
>  Hi,
>
> I have been working on Hudi for sometime and have an improvement
> suggestion.
>
> When we build a CDC pipeline, generally the field used for partitioning is
> date (created_at), and the general format of created_at is yyyy-MM-dd
> HH:mm:ss.S. If we have this field formatted to yyyy/MM/dd, then your hive
> queries for fetching data between any two dates become much complex, which
> is the usual case. For example,
>
> 1. If the partitions are in format yyyy/MM/dd, then query to select data
> for all days between 2015-01-01 and 2015-03-01 would look like,
>
> SELECT * FROM db.table where year=2015 and ((month=01 or month=02) or
> (month=03 and day=01))
>
> 2. Instead if partitions are in the format yyyy-MM-dd or yyyymmdd it
> supports direct queries on the data.
> e.g the above mentioned query would look like,
>
> SELECT * from db.table where DateStamp between ‘2015-01-01’ and
> ‘2015-03-01’.
>
>
> Reference -
> https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html
> <
> https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html
> >
>
> The proposal is to make the default partitioning to yyyy-mm-dd OR at least
> provide a provision to change the format.
>
> Please suggest on the above. Please find the jira raised here <
> https://issues.apache.org/jira/browse/HUDI-206> (HUDI-206).
>
>
> Regards,
> Pratyaksh

Reply via email to