Hi Tanu,

If your primary key is integer you can add one more field as hash of
integer and partition based on hash field. It will add some complexity to
read and write because hash has to be computed prior to each read or write.
Not whether overhead of doing this exceeds performance gains due to less
partitions. I wonder why HUDI don't directly support hash based partitions?

Thanks
Jaimin

On Wed, 3 Jun 2020 at 10:07, tanu dua <tanu.dua...@gmail.com> wrote:

> Thanks Vinoth for detailed explanation. Even I was thinking on the same
> lines and I will relook. We can reduce the 2nd and 3rd partition but it’s
> very difficult to reduce the 1st partition as that is the basic primary key
> of our domain model on which analysts and developers need to query almost
> 90% of time and its an integer primary key and can’t be decomposed further.
>
> On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <vin...@apache.org> wrote:
>
> > Hi tanu,
> >
> > For good query performance, its recommended to write optimally sized
> files.
> > Hudi already ensures that.
> >
> > Generally speaking, if you have too many partitions, then it also means
> too
> > many files. Mostly people limit to 1000s of partitions in their datasets,
> > since queries typically crunch data based on time or a business_domain
> (e.g
> > city for uber)..  Partitioning too granular - say based on user_id - is
> not
> > very useful unless your queries only crunch per user.. if you are using
> > Hive metastore then 25M partitions mean 25M rows in your backing mysql
> > metastore db as well - not very scalable.
> >
> > What I am trying to say is : even outside of Hudi, if analytics is your
> use
> > case, might be worth partitioning at lower granularity and increase rows
> > per parquet file.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <tanu.dua...@gmail.com> wrote:
> >
> > > Hi,
> > > We have a requirement to ingest 30M records in S3 backed up by HUDI. I
> am
> > > figuring out the partition strategy and ending up with lot of
> partitions
> > > like 25M partitions (primary partition) --> 2.5 M (secondary partition)
> > -->
> > > 2.5 M (third partition) and each parquet file will have the records
> with
> > > less than 10 rows of data.
> > >
> > > Our dataset will be ingested at once in full and then it will be
> > > incremental daily with less than 1k updates. So its more read heavy
> > rather
> > > than write heavy
> > >
> > > So what should be the suggestion in terms of HUDI performance - go
> ahead
> > > with the above partition strategy or shall I reduce my partitions and
> > > increase  no of rows in each parquet file.
> > >
> >
>

Reply via email to