Re: [Hudi Improvement]: Modification of partition path format to support simplified queries

Vinoth Chandar Wed, 21 Aug 2019 05:35:52 -0700

Great!

On Wed, Aug 21, 2019 at 4:48 AM Pratyaksh Sharma <[email protected]>
wrote:


> Hi Vinoth/Balaji,
>
> I am able to solve my use case using TimestampBasedKeyGenerator as the
> KeyGenerator. Thank you for suggesting the hook.
>
> On Sat, Aug 17, 2019 at 2:23 PM Pratyaksh Sharma <[email protected]>
> wrote:
>
> > Hi Vinoth,
> >
> > I am travelling right now with limited access to internet. Will check and
> > update you on Monday.
> >
> > On Thu, Aug 15, 2019, 10:09 AM Vinoth Chandar <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> Do these hooks seem sufficient to support what you are looking for?
> >>
> >> On Tue, Aug 13, 2019 at 8:16 PM [email protected] <[email protected]>
> >> wrote:
> >>
> >> >
> >> > Hi Pratyaksh,
> >> > The partitioning format is pluggable in Hudi.
> >> > 1. For Hudi Writing, you can simply use one of the several
> >> implementations
> >> > of org.apache.hudi.KeyGenerator or write your own implementation to
> >> control
> >> > partition path format. You can configure partition-path using
> >> >
> >>
> https://hudi.incubator.apache.org/configurations.html#KEYGENERATOR_CLASS_OPT_KEY
> >> > 2. For Hive Syncing, there are again some default implementations for
> >> > org.apache.hudi.hive.PartitionValueExtractor. You can also write your
> >> > custom partition value extractor and configure using
> >> >
> >>
> https://hudi.incubator.apache.org/configurations.html#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY
> >> >
> >> > Thanks,Balaji.V    On Tuesday, August 13, 2019, 03:23:57 AM PDT,
> >> Pratyaksh
> >> > Sharma <[email protected]> wrote:
> >> >
> >> >  Hi,
> >> >
> >> > I have been working on Hudi for sometime and have an improvement
> >> > suggestion.
> >> >
> >> > When we build a CDC pipeline, generally the field used for
> partitioning
> >> is
> >> > date (created_at), and the general format of created_at is yyyy-MM-dd
> >> > HH:mm:ss.S. If we have this field formatted to yyyy/MM/dd, then your
> >> hive
> >> > queries for fetching data between any two dates become much complex,
> >> which
> >> > is the usual case. For example,
> >> >
> >> > 1. If the partitions are in format yyyy/MM/dd, then query to select
> data
> >> > for all days between 2015-01-01 and 2015-03-01 would look like,
> >> >
> >> > SELECT * FROM db.table where year=2015 and ((month=01 or month=02) or
> >> > (month=03 and day=01))
> >> >
> >> > 2. Instead if partitions are in the format yyyy-MM-dd or yyyymmdd it
> >> > supports direct queries on the data.
> >> > e.g the above mentioned query would look like,
> >> >
> >> > SELECT * from db.table where DateStamp between ‘2015-01-01’ and
> >> > ‘2015-03-01’.
> >> >
> >> >
> >> > Reference -
> >> >
> >>
> https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html
> >> > <
> >> >
> >>
> https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.html
> >> > >
> >> >
> >> > The proposal is to make the default partitioning to yyyy-mm-dd OR at
> >> least
> >> > provide a provision to change the format.
> >> >
> >> > Please suggest on the above. Please find the jira raised here <
> >> > https://issues.apache.org/jira/browse/HUDI-206> (HUDI-206).
> >> >
> >> >
> >> > Regards,
> >> > Pratyaksh
> >>
> >
>

Re: [Hudi Improvement]: Modification of partition path format to support simplified queries

Reply via email to