Great! On Wed, Aug 21, 2019 at 4:48 AM Pratyaksh Sharma <[email protected]> wrote:
> Hi Vinoth/Balaji, > > I am able to solve my use case using TimestampBasedKeyGenerator as the > KeyGenerator. Thank you for suggesting the hook. > > On Sat, Aug 17, 2019 at 2:23 PM Pratyaksh Sharma <[email protected]> > wrote: > > > Hi Vinoth, > > > > I am travelling right now with limited access to internet. Will check and > > update you on Monday. > > > > On Thu, Aug 15, 2019, 10:09 AM Vinoth Chandar <[email protected]> wrote: > > > >> Hi, > >> > >> Do these hooks seem sufficient to support what you are looking for? > >> > >> On Tue, Aug 13, 2019 at 8:16 PM [email protected] <[email protected]> > >> wrote: > >> > >> > > >> > Hi Pratyaksh, > >> > The partitioning format is pluggable in Hudi. > >> > 1. For Hudi Writing, you can simply use one of the several > >> implementations > >> > of org.apache.hudi.KeyGenerator or write your own implementation to > >> control > >> > partition path format. You can configure partition-path using > >> > > >> > https://hudi.incubator.apache.org/configurations.html#KEYGENERATOR_CLASS_OPT_KEY > >> > 2. For Hive Syncing, there are again some default implementations for > >> > org.apache.hudi.hive.PartitionValueExtractor. You can also write your > >> > custom partition value extractor and configure using > >> > > >> > https://hudi.incubator.apache.org/configurations.html#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY > >> > > >> > Thanks,Balaji.V On Tuesday, August 13, 2019, 03:23:57 AM PDT, > >> Pratyaksh > >> > Sharma <[email protected]> wrote: > >> > > >> > Hi, > >> > > >> > I have been working on Hudi for sometime and have an improvement > >> > suggestion. > >> > > >> > When we build a CDC pipeline, generally the field used for > partitioning > >> is > >> > date (created_at), and the general format of created_at is yyyy-MM-dd > >> > HH:mm:ss.S. If we have this field formatted to yyyy/MM/dd, then your > >> hive > >> > queries for fetching data between any two dates become much complex, > >> which > >> > is the usual case. For example, > >> > > >> > 1. If the partitions are in format yyyy/MM/dd, then query to select > data > >> > for all days between 2015-01-01 and 2015-03-01 would look like, > >> > > >> > SELECT * FROM db.table where year=2015 and ((month=01 or month=02) or > >> > (month=03 and day=01)) > >> > > >> > 2. Instead if partitions are in the format yyyy-MM-dd or yyyymmdd it > >> > supports direct queries on the data. > >> > e.g the above mentioned query would look like, > >> > > >> > SELECT * from db.table where DateStamp between ‘2015-01-01’ and > >> > ‘2015-03-01’. > >> > > >> > > >> > Reference - > >> > > >> > https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html > >> > < > >> > > >> > https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html > >> > > > >> > > >> > The proposal is to make the default partitioning to yyyy-mm-dd OR at > >> least > >> > provide a provision to change the format. > >> > > >> > Please suggest on the above. Please find the jira raised here < > >> > https://issues.apache.org/jira/browse/HUDI-206> (HUDI-206). > >> > > >> > > >> > Regards, > >> > Pratyaksh > >> > > >
